1 minute read

Meta info.
  • Authors: Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula
  • Paper: https://arxiv.org/pdf/2504.05185
  • Affiliation: Wand AI
  • Published: April 7, 2025

TL; DR

RL๋กœ ํ•™์Šต๋œ LLM์ด ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๊ธด ์ถ”๋ก ์„ ์ƒ์„ฑํ•˜์ง€๋งŒ, 2-phrase RL๋กœ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ฐ„๊ฒฐํ•œ ์ถ”๋ก ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

์ตœ์‹  LRMs๋“ค์ด ๊ธด CoT ์ถ”๋ก ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์œ ์˜ํ•˜๋‹ค๋Š” ์ƒ๊ด€๊ด€๊ณ„ ๋ณด๊ณ 

  • ์ผ๋ถ€ ์—ฐ๊ตฌ์—์„œ๋Š” ์˜คํžˆ๋ ค response๊ฐ€ ์žฅํ™ฉํ•œ ๊ฒฝ์šฐ dead-ends๋ฅผ ์œ ๋„ํ•˜๊ฑฐ๋‚˜ ๋•Œ์— ๋”ฐ๋ผ์„œ๋Š” ๊ธธ์–ด์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ €ํ•˜๋กœ ์ด์–ด์ง„๋‹ค๊ณ  ์ง€์ 
  • dead-ends: LLM์ด ์ƒ์„ฑ ์ค‘ ๋งž๋Š” ๋‹ต์œผ๋กœ ๋ณต๊ตฌํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์€ ์ƒํƒœ

Problem States

  • longer CoT == better reasoning?
  • ์™œ RLํ•˜๋ฉด LLM์ด ์‘๋‹ต์„ ๊ธธ๊ฒŒ ํ•˜๋Š”๊ฐ€?
  • ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด reasoning์„ ์งง๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

Suggestions

  • PPO์— ๋Œ€ํ•œ ์ˆ˜ํ•™์  ๋ถ„์„: ๊ธด ์‘๋‹ต์ด ์™œ RL Objective๋กœ๋ถ€ํ„ฐ ๋ฐœ์ƒ๋˜๋Š”์ง€ ์„ค๋ช…
    • ๊ธด ์‘๋‹ต์ด ๋” ๋‚˜์€ ์ถ”๋ก ์˜ ๊ณ ์œ ํ•œ ํŠน์„ฑ์ด ์•„๋‹ˆ๋‹ค!
      • ๊ฐ Reasoning์„ MDP๋กœ formalizeํ•˜๊ณ ,
      • reward๊ฐ€ sparseํ•˜๊ณ  delayed๋˜์–ด(reward๋Š” t-1, ์ฆ‰ ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ๋งŒ ๋ฐœ์ƒ) PPO์˜ ํ† ํฐ๋‹น ์†์‹ค ๊ณ„์‚ฐํ•œ ๊ฒฐ๊ณผย pic 1
    • ฮป < 1์—์„œ PPO Loss๊ฐ€
      • reward<0 : ๋ณธ์งˆ์ ์œผ๋กœ ๋” ๊ธด ์‘๋‹ต์„ ์œ ๋„ํ•˜๊ณ ,
      • reward>0 : ๋” ์งง์€ ์‘๋‹ต์„ ์œ ๋„ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ถ„์„
  • 2-phrase RL ์ œ์•ˆ
    1. ์–ด๋ ค์šด ๋ฌธ์ œ๋กœ reasoning capability ํ–ฅ์ƒย (๊ธด CoT ์‹œํ‚ค๊ณ ): base model์ด ์•„์˜ˆ ๋ชปํ‘ธ๋Š” ๋ฌธ์ œ โ†’ ๋Œ€๋ถ€๋ถ„ negative reward โ†’ PPO๊ฐ€ ๋” ๋งŽ์€ token ์ƒ์„ฑํ•˜๋„๋ก
    2. ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์˜ ๋ฌธ์ œ๋กœ ์งง์€ ๋‹ต๋ณ€ ์ƒ์„ฑ: base model์ด ๋‹ต์„ ํ’€ ํ™•๋ฅ p_a์ด ์–‘์ˆ˜์˜€๋˜ ๋ฌธ์ œ โ†’ ๊ฐ€๋” positive reward โ†’ PPO ๊ฐ€ ์ ์€ ์ˆ˜์˜ token์„ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ย Fig 2

Effects

  • ์ •ํ™•ํ•œ ๋‹ต์€ ์งง๋‹คย Tab 1ย : ์ •๋‹ต์ด ์˜ค๋‹ต๋ณด๋‹ค ์งง๋‹ค๊ณ 
    • backbone์— R1, Qwen, Phi-4, โ€ฆ์— ๋Œ€ํ•ด MATH500, AIMEโ€™24, MMLU-STEM ๋“ฑ์œผ๋กœ ํ™•์ธ
  • 2-phrase RL
    • 2๋‹จ๊ณ„์—์„œ ์‘๋‹ต ๊ธธ์ด๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๊ฐ์†Œย Fig 3
      • R1-1.5B ์ถœ๋ ฅ ๊ธธ์ด๊ฐ€ ํ‰๊ท  6848ํ† ํฐ์—์„œ 3119ํ† ํฐ์œผ๋กœ ๊ฐ์†Œํ–ˆ์ง€๋งŒ ์ •ํ™•๋„ ์œ ์ง€ย Tab 2
    • 2๋‹จ๊ณ„ RL ํ›„์—๋Š” greedy decoding (temp. = 0)์—์„œ๋„ ์„ฑ๋Šฅ ์œ ์ง€๋˜์–ด ๊ฐ•๊ฑด์„ฑ์— ๋Œ€ํ•ด ์ž…์ฆย Tab 3
      • R1-1.5B temp. = 0์—์„œ MATH500 ์ •ํ™•๋„๊ฐ€ 70%์—์„œ 81%๋กœ ํ–ฅ์ƒ๋˜๊ธฐ๋„
  • ฮป < 1ย : PPO Objective๊ฐ€ ์งง์€ ์‘๋‹ต์„ ์„ ํ˜ธํ•˜๊ฒŒ ํ•˜๋Š” ํ•ต์‹ฌ์œผ๋กœ,
    • ฮป = 1์—์„œ๋Š” PPO๊ฐ€ ๋ถˆ์•ˆ์ •ํ•ด์ง€๊ณ  value estimates๊ฐ€ ๋‹ฌ๋ผ์ ธ์„œ over-/under-flowย Fig 5, 6
  • 8๊ฐœ ๋ฌธ์ œ๋งŒ์œผ๋กœ RL post-training: R1 ๋ชจ๋ธ ์‘๋‹ต ๊ธธ์ด๋ฅผ ์ ˆ๋ฐ˜ ์ดํ•˜๋กœ ์ค„์ด๊ณ  ์ •ํ™•๋„๋Š” ์œ ์ง€
    • Qwen์˜ ๊ฒฝ์šฐย 4๊ฐœ๋งŒ์œผ๋กœ๋„ 30% ์„ฑ๋Šฅ ํ–ฅ์ƒ

Personal note. ์•„์ง ฮป์— ๋Œ€ํ•œ ์ตœ์ ๊ฐ’์— ๋Œ€ํ•œ ๋ถ„์„์ด ๋” ํ•„์š”ํ•˜๋‹ค๊ฑฐ๋‚˜ GRPO ๋“ฑ์œผ๋กœ ํ™•์žฅ๋˜์ง€ ๋ชปํ•˜๋Š” ์ ์€ ์ €์ž๋“ค๋„ ์ง€์ ํ•œ ํ•œ๊ณ„์ž…๋‹ˆ๋‹ค๋งŒ, PPO ์ž์ฒด์˜ ๋ฌธ์ œ๋ฅผ ๋ถ„์„ํ•œ ์ ์ด ์ธ์ƒ์ ์ด๊ณ , RL post-traing์— ์ง„์งœ 8๊ฐœ ํ˜น์€ 4๊ฐœ๋งŒ ์ผ๋‹ค๋Š”๊ฒŒ ๋†€๋ผ์šด ๊ฒฐ๊ณผ๋„ค์š”. ๊ธด ๋‹ต๋ณ€์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ๋Š” ์ดˆ๊ธฐ ์—ฐ๊ตฌ๊ฐ€ ๋  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค. ์ผ์ข…์˜ curriculum RL์ด ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ๋„ ๋ณด์—ฌ์š”.