less than 1 minute read

Meta info.

TL; DR

PbRL์„ ์œ„ํ•œ ์ ๋Œ€์  ์„ ํ˜ธ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋ก  APPO ์ œ์•ˆ

image.png

image.png

Background

์ธ๊ฐ„์˜ preference feedbackํ™œ์šฉํ•˜๋ฉด RL์—์„œ reward design์ด ์–ด๋ ต๋‹ค๋Š” ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋”๋ผ

Problem States

humanย feedback์ด ๋น„์‹ธ๊ณ  ์‹ค์‹œ๊ฐ„์œผ๋กœ(online) ๋ฐ›๊ธฐ๋„ ์–ด๋ ค์šฐ๋ฉฐ, Offline PbRL์—์„œ๋Š” ๋ณด์ˆ˜์„ฑconservatism ํ™•๋ณด๋ฅผ ์œ„ํ•ด์„œ ์‹ ๋ขฐ๊ตฌ๊ฐ„ ๊ณ„์‚ฐ์„ ํ•ด์ฃผ๋Š”๋ฐ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋ณต์žกํ•จ

  • Research Question: ๊ณ„์‚ฐ ํšจ์œจ์ด๋ž‘ (confidence set ์—†์–ด๋„) ๋ณด์ˆ˜์„ฑ ๋‹ค ์ฑ™๊ธฐ๋Š” offline PbRL ์—†์„๊นŒ

Suggestion

APPO

  • PbRL์„ ์ •์ฑ…(policy)๊ณผ reward ๋ชจ๋ธ ๊ฐ„ 2์ž ๊ฒŒ์ž„(two-player game)์œผ๋กœ ์žฌ๊ตฌ์„ฑ
    • leader (policy model, ฯ€): maximize preference score, TRPO
    • follower (reward model, r): ๋ณด์ˆ˜์ ์ธ reward model์„ adversarial optimization
      • leader๊ฐ€ ๋„ˆ๋ฌด ๋ถˆํ™•์‹คํ•œ ๊ณณ์—์„œ ํƒํ—˜ํ•˜์ง€ ์•Š๋„๋ก ์ œ์•ฝ
  • ์ด๋ก ์ ์œผ๋กœย ์ƒ˜ํ”Œ ๋ณต์žก๋„๋ฅผ ์ฆ๋ช…ํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ณด๋‹ค ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์—์„œ ์ด๋“
    • feedback ์ˆ˜๊ฐ€ ์ ์–ด๋„ optimized policy ํ•™์Šต ๊ฐ€๋Šฅ

Effects

์‹คํ—˜์ ์œผ๋กœย continuous control ํ™˜๊ฒฝ์—์„œ ์ตœ์‹  ๋ฐฉ๋ฒ•๋ก ๊ณผ ๋น„๋“ฑํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์Œ

  • target task: Meta-World Benchmark
  • baseline: Markovian reward, preference transformer, dppo, ipl(inverse preference learning)
  • table1ย : # of feedback 500/1000 ๊ธฐ์ค€์œผ๋กœ ์•„๋ฌด์ชผ๋ก appo๊ฐ€ ๋ชจ๋“  ์‹คํ—˜์—์„œ ๊ฐ€์žฅ ๋†’์€ ํ‰๊ท ์ˆœ์œ„ ๊ธฐ๋ก

Personal note. ์•ž๋ถ€๋ถ„์— ์†Œ๊ฐœํ•˜๋Š” preliminary๊ฐ€ ์ถฉ๋ถ„ํžˆ ์นœ์ ˆํ•ด์„œ ์ •๋ฆฌํ•˜๋Š” ๋А๋‚Œ๋„ ๋ฐ›์•„์„œ ์‹œ์ž‘ํ–ˆ๋Š”๋ฐ ์ด๋ก ์ ์ธ ํ๋ฆ„ ์œ„์ฃผ๋ผ ๋ฌด์ฒ™ ๊ฑด์กฐํ•œ ํŽธ์ด๊ธด ํ•˜๋„ค์š”. ์ฐจ๊ทผํžˆ ๋”ฐ๋ผ๊ฐ€๋ฉด ๋ชจ๋“  ์ด๋ก ์„ ์™„๋ฒฝํ•˜๊ฒŒ ์ดํ•ดํ•˜์ง€๋Š” ๋ชปํ–ˆ์ง€๋งŒ ํ•˜์ด์ปจ์…‰์—์„œ ์˜๋„์™€ ๋ชฉ์  ๊ฒฐ๊ณผ๋ฅผ ์ดํ•ดํ•  ์ˆ˜๋Š” ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.