2 minute read

Meta info.

TL; DR

SFT๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์•”๊ธฐํ•œ๋‹ค๋ฉด, RL์€ Rule-based text/vision reasoning ๋ชจ๋‘์—์„œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ฐฐ์šด๋‹ค.

image 1 image 2 image 3 image 4 image 5 image 6 image 7 image

Background

LLM post training ์—ฐ๊ตฌ๋Š” ํ•œ์ชฝ์œผ๋กœ๋งŒ ์ดˆ์ ์„ ๋งž์ถฐ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋จ (๊ฐ ๋ฐฉ์‹์ด ์–ผ๋งˆ๋‚˜ ์ผ๋ฐ˜ํ™”์™€ ์•”๊ธฐ๋ฅผ ์ž˜ํ•˜๋‚˜๋ฅผ ํ™•์ธํ•˜๋Š” ๋“ฑ)

  • SFT: format adaptation์„ ์œ„ํ•œ static ๋ฐ์ดํ„ฐ๋กœ instruction tuning (FLAN, LIMA ๋“ฑ) >ย ์ƒˆ๋กœ์šด ๊ทœ์น™/์‹œ๊ฐ์  ๋ณ€ํ™”์—๋Š” ์•ฝํ•จ
  • RL: Outcome-driven optimization (human feedback or proxy reward) (RLHF์˜ PPO ๋“ฑ)

Problem States

SFT์™€ RL์€ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐ€๋ฅด์น˜๋Š”๊ฐ€? ์•„๋‹ˆ๋ฉด ํ•™์Šต๋ฐ์ดํ„ฐ ์•”๊ธฐ๋ฅผ ๊ฐ€๋ฅด์น˜๋Š”๊ฐ€?

  • ํŠนํžˆ multimodal ์ถ”๋ก (text+vision)์— ์žˆ์–ด ๊ทœ์น™ ๋ณ€๊ฒฝ(Rule variant), ์‹œ๊ฐ ์ž…๋ ฅ ๋ณ€๊ฒฝ(Visual OOD variant)์ด ๋‹ฌ๋ผ์กŒ์„ ๋•Œ ์–ผ๋งˆ๋‚˜ robustํ•œ๊ฐ€?

Suggestions

๋™์ผํ•œ ์กฐ๊ฑด์—์„œ SFT์™€ RL์„ ๊ฐ๊ฐย ์ผ๋ฐ˜ํ™”์™€ย ์•”๊ธฐ๋ฅผย ๋ถ„๋ฆฌํ•ด ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•˜์ž

  • Goal: SFT์™€ RL์„ ๋™์ผ ์ˆ˜์ค€์˜ training compute์—์„œ ๋™์ผ task์— ๋Œ€ํ•ด, trainig์— ์—†๋˜ ์กฐ๊ฑด(=OOD)์„ testํ•  ๋•Œ ์–ด๋–ป๊ฒŒ ๋ฐ˜์‘ํ•˜๋Š”๊ฐ€?
  • ์•”๊ธฐ: ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๋ณธ ํŒจํ„ด๋งŒ์„ ๊ทธ๋Œ€๋กœ ์žฌํ˜„ํ•˜๋Š” ๊ฒƒ. surface form๋งŒ ๋ณต์ œ
  • ์ผ๋ฐ˜ํ™”: ์•”๊ธฐ๋ฅผ ๋„˜์–ด์„œ, ์ƒˆ๋กœ์šด ๊ทœ์น™์ด๋‚˜ ์กฐํ•ฉ์— ๋Œ€ํ•ด ์‘์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ. ์ƒˆ๋กœ์šด ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ ์‘๋Šฅ๋ ฅ์— ์ค‘์ .
  • Tasks: ๊ฐ ํ…Œ์Šคํฌ์˜ train-test ๋ถ„ํฌ ์˜๋„์ ์œผ๋กœ ๋ถ„๋ฆฌ, ์ž˜ ํ’€๋ฉด ์ผ๋ฐ˜ํ™”๋ฅผ ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์•”๊ธฐ๋งŒ ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ
    • #1ย GeneralPointsย (์ˆซ์ž ์นด๋“œ 4์žฅ์œผ๋กœ 24 ๋งŒ๋“ค๊ธฐ)
      • Rule variant: ๊ทœ์น™ ๊ฐœ์ˆ˜ K๋ฅผ 10์—์„œ 13์œผ๋กœ ๋ณ€๊ฒฝ (text)ย GP-L
      • Visual OOD variant: ์นด๋“œ ์ƒ‰(โ™ ๏ธโ™ฃ๏ธย >ย โ™ฅ๏ธโ™ฆ๏ธ) ๋ณ€๊ฒฝ (vision)ย GP-VL
    • #2 V-IRLย (์‹ค์ œ ๊ฑฐ๋ฆฌ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ๊ธธ์ฐพ๊ธฐ)
      • Rule variant: ์ง€์‹œ์–ด ์ ˆ๋Œ€๋ฐฉํ–ฅ <โ†’ ์ƒ๋Œ€๋ฐฉํ–ฅ (text)ย V-IRL-L
      • Visual OOD variant: NYC <โ†’ ๋‹ค๋ฅธ ๋„์‹œ(text + vision)ย V-IRL-VL

Effects

  • ํ•™์Šต ๊ตฌ์กฐ: SFT > RL ์ˆœ์„œ
    • RL ๋‹จ๋…์€ ์‹คํŒจ: SFT ์—†์ด RL ๋‹จ๋…์œผ๋กœ ํ•™์Šต ์‹œ instruction-following ์ž์ฒด๊ฐ€ ์•ˆ ๋˜์–ด์„œ ์‹คํ—˜ ์กฐ๊ฑด์—์„œ ์ œ์™ธ
    • RL์€ multi-turn PPO + verifier reward ๊ตฌ์กฐ: verifier-based reward๋กœ feedback ๋ฐ˜๋ณตํ•˜๋Š”(multi-turn) training
      • VER(v_t^out) โ†’ (r_t, v_t^ver): ์™ธ๋ถ€์˜ verifier๊ฐ€ ์ •๋‹ต์„ ๋งž์ท„๋Š”์ง€ ์ง์ ‘ ํŒ๋‹จํ•œ Signal์„ reward๋กœ ํ™œ์šฉ
        • v_t^out: ๋ชจ๋ธ ์ถœ๋ ฅ, v_t^ver: ์ž์—ฐ์–ด ํ˜น์€ ๊ตฌ์กฐํ™”๋œ Feedback message
        • r_t: ๋ชจ๋ธ ์ถœ๋ ฅ์˜ reward
      • multi-turn: verifier๊ฐ€ ํ‹€๋ ธ๋‹ค๊ณ  ์•Œ๋ ค์ฃผ๋ฉด, ๋ชจ๋ธ์€ ๊ทธ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ˜์˜ํ•ด์„œ ๋‹ค์Œ ํ„ด์— ๊ฐœ์„ ์„ ์‹œ๋„ํ•˜๋Š” ๊ตฌ์กฐ
    • ๊ฐ step๋งˆ๋‹ค ์ƒ์„ฑ > ๊ฒ€์ฆ > ํ”ผ๋“œ๋ฐฑ ์ˆ˜๋ ด (sequential revision): ์ƒ์„ฑ ์•ˆ์ •ํ™”๋ฅผ ์œ„ํ•œ ์ˆœ์ฐจ์ ์ธ ํ”„๋กฌํ”„ํŠธ ์ˆ˜์ •
  • metrics:
    • Success Rate: ์ตœ์ข… ๋‹ต ๋„๋‹ฌ ๋น„์œจ
    • Recognition Accuracy: ์ด๋ฏธ์ง€์—์„œ ์ˆซ์ž/๋žœ๋“œ๋งˆํฌ ์ธ์‹ ์ •ํ™•๋„
    • Per-step accuracy: V-IRL์—์„œ ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ ์ •ํ™•๋„
  • results:
    • RL์€ SFT ๋Œ€๋น„ ๋ชจ๋“  task์—์„œ OOD generalization ํ–ฅ์ƒ
    • Visual OOD ์กฐ๊ฑด์—์„œ๋„ RL์ด consistentํ•˜๊ฒŒ ์„ฑ๋Šฅ ์šฐ์œ„
    • SFT๋Š” visual reasoning token์— overfitting
    • verifier ๋ฐ˜๋ณต ํšŸ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก OOD ์„ฑ๋Šฅ ๊ฐœ์„ : 10ํšŒ์— +5.99%
    • SFT๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ overfit๋œ ์ƒํƒœ๋กœ RL์„ ์‹œ์ž‘ํ•˜๋ฉด OOD ์„ฑ๋Šฅ ๋ณต๊ตฌ ๋ถˆ๊ฐ€

Personal note. robustํ•˜๊ณ  generalizableํ•œ foundation model ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค๋ฉด RL ๊ผญ ํ•ด์•ผ๋œ๋‹ค๋Š” ๊ฒฐ๋ก . SFT๊ฐ€ ํ˜•์‹์„ ๋ฐฐ์šฐ๊ฒŒ ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค๋ฉด (์ˆœ์„œ์ƒ ๊ทธ ํ›„์— ๋ถ™์ธ) RLํ•ด์ฃผ๋ฉด ์ผ๋ฐ˜ํ™”์—์„œ ํ™•์‹คํžˆ ์šฐ์œ„๋ฅผ ๊ฐ€์ง„๋‹ค๋ฅผ ๊ฒฝํ—˜์ ์œผ๋กœ ํ™•์ธํ•œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. (+์™ธ๋ถ€ verifier๋ฅผ ๋‘๋Š” ๊ฒŒ ์ด๋“์ด๋‹ค๊นŒ์ง€๋„ ) TACT์—์„œ DPO๋งŒ ํ•œ๊ฑด ์—ฌ์ „ํžˆ ํ•œ๊ณ„์ง€๋งŒ, ํ•ด๋ด„์งํ–ˆ๋˜ ์ด์œ ๋กœ ๋“ค๊ณ  ์‹ถ์–ด์„œ ์ข€ ์ž์„ธํžˆ ๋ณธ ๋…ผ๋ฌธ์ธ๋ฐ ์ง์ ‘ ์ธ์šฉํ•ด์„œ ๋‹ต์„ ํ–ˆ๋”๋ผ๋ฉด ๋” ์ข‹์•˜์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. revision์—์„œ๋Š” ๋ณด๋‹ค ์ „๋ฉด์—์„œ ์–ธ๊ธ‰ํ•˜๋„๋ก ํ•ด์•ผ๊ฒ ์Šต๋‹ˆ๋‹ค.