1 minute read

Meta info.

TL; DR

test ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ majority-voting์œผ๋กœ reward ์ถ”์ •, ์ด๋ฅผ ํ†ตํ•ด RL ์‹œ๋„ํ•˜๋Š” ์ œ์•ˆ TTRL์ดย reasoning ์„ฑ๋Šฅ์„ x2~x3๊นŒ์ง€ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

Test-time scaling์˜ ์„ฑ๋Šฅ ๊ฐœ์„  ๋ณด๊ณ 

  • test-time scaling: ์‚ฌ์ „ํ•™์Šต ์ปดํ“จํŒ… ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ ๋„ ์ธํผ๋Ÿฐ์Šค ํƒ€์ž„์— ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ 

Problem States

ground-truth ์—†์ด test-time์— ๊ฐ•ํ™”ํ•™์Šต์‹œํ‚ฌ์ˆ˜ ์žˆ์„๊นŒ?

  • ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ reasoning์„ ์œ„ํ•œ RL์—์„œ CoT ํ•˜๊ธด ํ•ด๋„ ์–ด์จŒ๋“  ์ •๋‹ต label์ด ํ•„์š”
  • Test-time training์ด ์žˆ๊ธฐ๋Š” ํ–ˆ์ง€๋งŒ RL์˜ ๋น„์ง€๋„ํ•™์Šต์€ ์•„๋‹Œ

Suggestions

Test-Time Reinforcement Learning

  • ๋‹คํšŒ rollouts (๋ณดํ†ต 64) โ†’ majority voting์œผ๋กœ pseudo-label ์ถ”์ •
  • Reward: majority voting๊ณผ ์ผ์น˜ํ•˜๋ฉด 1 ์•„๋‹ˆ๋ฉด 0
  • training efficiency๋ฅผ ์œ„ํ•ด ํ•™์Šต์—์„œ๋Š” ๊ทธ์ค‘์— 16๊ฐœ๋งŒ์œผ๋กœ downsampling..
  • key concept: RL โ€œbootstrapsโ€ on majority-voted labels, allowingย unsupervised continual learning

Effects

  • target tasks: AIME 2024, AMC, MATH-500
  • results: ground-truth label ์—†๋Š”๋ฐ๋„ ์ผ๋ฐ˜ RL ์„ฑ๋Šฅ ํฌ๊ฒŒ ์ƒํšŒย Tab 1
    • ์ œ์•ˆ ๋ฐฉ๋ฒ•์ด initial policy์˜ Majority voting์˜ ์ƒํ•œ์„ ์ƒํšŒย Fig 6
    • out-of-domain ์—์„œ๋„ ๊ฐ•๊ฑดํ•œ ์„ฑ๋Šฅย Fig 3
      • ์ •๋‹ต ์™ธ์šฐ๊ฑฐ๋‚˜ Over optimization ๋˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ์ง„์ž Reasoning์„ ๋ฐฐ์› ๋‹ค๊ณ  ์ฃผ์žฅ
    • ์ œ์•ˆ ๋ฐฉ๋ฒ•์€ test data ์ •๋‹ต ์•Œ๊ณ  (์‹ ์ด๋‚ด๋ฆฐ reward๋ชจ๋ธ์ด๋ผ ์ •๋‹ต์„ ์•Œ๊ณ ์žˆ๋Š”) RL ํ•œ ๊ฒƒ(upperbound)์— ๊ทผ์‚ฌํ•˜๋Š” ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ๋ณด๊ณ ย Fig 7
  • discussions
    • ์™œ ์ •๋‹ต ์—†์ด๋„ ์ž˜ํ• ๊นŒ : ์›๋ž˜ RL์— ํ•„์š”ํ•œ๊ฑด ์ •๋‹ต์ด๋ผ๊ธฐ๋ณด๋‹ค๋Š” ์ข‹์€ ๋ฐฉํ–ฅ์„ฑ(signal)์ด๋‹ˆ๊นŒ, ๋Œ€์ฒด๋กœ ๋งž๋Š” ๋ณด์ƒ๋งŒ ์žˆ์œผ๋ฉด ํ•™์Šต ๊ฐ€๋Šฅ โ†’ ๊ทธ๊ฑธ Majority voting์ด ์ˆ˜ํ–‰
      • Majority voting์ด ์ง„์งœ ์•ˆ๋…•ํ•œ๊ฐ€: ๊ฒฝํ–ฅ์ƒ ๋ชจ๋ธ์ด ์ž์‹ ์žˆ๋Š” ๋‹ต์„ ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Œ. ์˜ค๋‹ต์ด๋”๋ผ๋„ ๋œ ํ‹€๋ฆฐ๊ฒƒ์— voting๋˜๋Š” ๊ฒฝํ–ฅ. ์ƒ๊ฐ๋ณด๋‹ค ๊ฐ•๊ฑดํ•œ pseudo-label ์ œ๊ณต
      • reward sparsity๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜๋„: supervised learning์—์„œ๋Š” ํ‹€๋ฆฌ๋ฉด ๋‹ค 0์ด์ง€๋งŒ, pseudo label์€ ํ‹€๋ฆฐ ๊ฒƒ๋“ค ์‚ฌ์ด์—์„œ๋„ ์ธต์œ„๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์—ฌ์ง€๊ฐ€ ์žˆ์Œ
    • ์™œย Fig 7ย ์—์„œ ํ•™์Šต ์ดˆ๋ฐ˜์—๋Š” TTRL์ด ๋” ๋А๋ฆฌ๊ฒŒ ์„ฑ๋Šฅ์ด ์˜ค๋ฅผ๊นŒ
      • leakage๋œ ๊ฒฝ์šฐ๋Š” ์–ด์ฐจํ”ผ ์ •๋‹ต์„ ์•„๋‹ˆ๊นŒ ํƒ์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์—†๋Š”๋ฐ ๋ฐ˜ํ•ด TTRL์€ ์ดˆ๋ฐ˜๋ถ€ํ„ฐ trial-and-errorํ•„์š” โ†’ ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š” ์š”์ธ
      • ๊ฒฐ๊ตญ ๋”ฐ๋ผ์žก๋Š” ๊ฑธ ๋ณด๋ฉด, self-improvement loop๋ฅผ ํ†ตํ•ด pseudo-label์ด ์ข‹์•„์ง€๊ธฐ ๋–„๋ฌธ์œผ๋กœ ๋ถ„์„ ๊ฐ€๋Šฅ

Personal note. 64๋ฒˆ์˜ rollouts๊ฐ€ ๋งŽ๊ฒŒ ๋А๊ปด์งˆ ์ˆ˜๋Š” ์žˆ๋Š”๋ฐ reward modeling ํ•˜๋Š”๊ฑฐ ์ƒ๊ฐํ•ด๋ณด๋ฉด ์‹ค์งˆ์ ์œผ๋กœ ํšจ์œจ์ด ํฌ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค๊ณ , ์˜คํžˆ๋ ค ๋ถ€์ •ํ™•ํ•œ pseudo-label ๋•๋ถ„์— ๊ฐ•๊ฑดํ•œ exploration์„ ์œ ๋„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ข‹๋‹ค๋Š” ๊ฒฐ๋ก ๊นŒ์ง€ ์šฉ๋‘์‚ฌ๋ฏธ๊ฐ€ ์•„๋‹ˆ๋ผ ์šฉ๋‘์šฉ๋ฏธ๊ฐ€ ๋˜๋„ค์š”โ€ฆ ๋‹ค์ˆ˜๊ฒฐ์ด ์ด๋ ‡๊ฒŒ๋‚˜ ๊ฐ•๊ฑดํ•˜๊ฒŒ ์ข‹๋‹ค๊ณ โ€ฆ