less than 1 minute read

Meta info.

TL; DR

RM์œผ๋กœ Policy model์„ ํ•™์Šตํ•˜๋ฉด ํ•™์Šตํ• ์ˆ˜๋ก real (human) preference์™€ ๊ฒฉ์ฐจ๊ฐ€ ๋ฒŒ์–ด์ง€๋Š” overoptimization์ด (๋ฐ˜๋“œ์‹œ) ๋ฐœ์ƒ๋˜๋ฉฐ, ์ด ํ˜„์ƒ์˜ ๋„๋‹ฌ์„ ๋Šฆ์ถ”๋Š”(?) ๋ฐ์—๋Š” RM์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์šฐ๋Š”๊ฒŒ ์œ ์˜ํ•œ ์˜ํ–ฅ์„ ๋ผ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„.

Untitled

Untitled

Untitled

Effects

  • setup
    • human data (real performance) ๋กœ ํ•™์Šตํ•œ RM์„ Gold RM์œผ๋กœ
    • ์ด์— ๋”ํ•ด synthetic data๋กœ ํ•™์Šตํ•œ RM์„ Proxy RM์œผ๋กœ ๊ฐ€์ •
    • ์ฆ‰ ์ „์ž๋ฅผ gold๋กœ ๊ฐ€์ •ํ•˜์—ฌ, initial Policy ๋ชจ๋ธ์„ Proxy RM์œผ๋กœ RL์ด ์ง„ํ–‰๋ ์ˆ˜๋ก (KL distance๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก) ๋ฒŒ์–ด์ง€๋Š” ๊ฒฉ์ฐจ์— ๋Œ€ํ•œ emperical ์—ฐ๊ตฌ
  • results
    • RM์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์šฐ๋Š”๊ฑด ํ˜„์ƒ์„ ๋Šฆ์ถ”๋Š” ๋ฐ์—๋Š” ๋„์›€์ด ๋˜๋‚˜
    • RM ์‚ฌ์ด์ฆˆ ๊ณ ์ •ํ•˜๊ณ  RL์— ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ๋”ฑํžˆ scaling law๊ฐ€ ๋ฐœ๊ฒฌ๋˜์ง€ ์•Š์•˜๊ณ 
      • ๋‹ค๋งŒํŠน์ • ์ˆ˜์ค€์œผ๋กœ data size๊ฐ€ ๋„˜์–ด๊ฐ€๋ฉด RM validation loss ๊ฐ€ ๊ฐ์†Œํ–ˆ๋Š”๋ฐ,
      • Proxy RM size๋‚˜ data size์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ, RM์ด ๋น„์Šทํ•œ validation loss๋ฅผ ๊ฐ€์ง€๋ฉด ๋น„์Šทํ•œ gold score (=ํ•™์Šต์ˆ˜์ค€) performance๋ฅผ ๋‚ธ๋‹ค๊ณ (..?๋น„์•ฝ์ด ๋„ˆ๋ฌด ์‹ฌํ•œ๊ฑฐ๊ฐ™์€๋ฐ)
    • Policy model size (=LM)์— ๋Œ€ํ•ด์„œ๋Š” ์ž‘์€ ๋ชจ๋ธ์ผ์ˆ˜๋ก ๋” ์ตœ์ ํ™”๋กœ ์–ป๋Š” ํšจ๊ณผ๊ฐ€ ๋” ํฌ๋‹ค๊ณ .
    • Gold score ๊ธฐ์ค€ RL > Best of N ( Rejection Sampling )ย ํ•™์Šต์€ ๋”ํ•ด์•ผ๊ฒ ์ง€๋งŒ,,