2 minute read

Meta info.

TL; DR

์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ๋‹จ์ˆœ MCMC ๊ธฐ๋ฐ˜ ์ƒ˜ํ”Œ๋ง๋งŒ์œผ๋กœ LLM์˜ base model์ด RL๋กœ post-training๋œ ๋ชจ๋ธ ์ˆ˜์ค€์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

image 1 image 2 image 3 image 4 image 5 image 6 image 7 image

Background

  • LLM์˜ RL๊ธฐ๋ฐ˜ post training์ด ์ผ๋ฐ˜ํ™”๋˜๊ณ , ์ด๋ ‡๊ฒŒ ํ•™์Šต๋œ ๋ชจ๋ธ์€ ์ˆ˜๋ฆฌ์ถ”๋ก , ์ฝ”๋”ฉ ๋“ฑ์—์„œ ๋ˆˆ์— ๋„๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ
  • ๊ทธ๋ ‡๋‹ค๊ณ  RL ๋“ฑ์ด ์ƒˆ๋กœ์šด ๋Šฅ๋ ฅ์„ ํ•™์Šตํ•˜๊ฒŒ ํ•˜๋Š” ๊ฑด ์•„๋‹ˆ๋ผ๋Š” ๊ฒฌํ•ด๋„ ๋“ฑ์žฅ [He et al. 2025, Yue et al. 2025]
    • RL์€ ๊ทธ๋ƒฅ SFT ๋ชจ๋ธ์ด ์ด๋ฏธ ์ž˜ํ•˜๋Š” high-likelihood reasoning ๊ฒฝ๋กœ๋ฅผ ์ง‘์ค‘์ ์œผ๋กœ ์„ ํƒํ•˜๊ฒŒ ๋งŒ๋“ค ๋ฟ

Problem States

์ถ”๊ฐ€ ํ•™์Šต ์—†์ด๋„, ๋‹จ์ˆœํžˆ ์ƒ˜ํ”Œ๋ง ๊ณผ์ •๋งŒ ์กฐ์ •ํ•ด์„œ RL ์ˆ˜์ค€์˜ reasoning ๋Šฅ๋ ฅ์„ ๋Œ์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€?

  • RL ๋น„์šฉ์ด ๋น„์‹ธ๋‹ˆ, inference-time only๋กœ base model์„ shapeํ•˜๊ฒŒ resamplingํ•˜๊ธฐ

Suggestions

Power sampling via MCMC

  • power distribution sampling : ๋ชฉํ‘œ ๋ถ„ํฌ p_\alpha์ธ p(x)p(x)๊ฐ€ >1์ธ ๊ฒฝ์šฐ ์ด๋ฏธ ๋ชจ๋ธ์ด ๋†’๊ฒŒ ํ‰๊ฐ€ํ•œ ํ›„๋ณด (ํ† ํฐ ํ™•๋ฅ ์ด ๋†’์Œ)๊ฐ€ ๊ฐ•ํ™” ; test-time ์ „์šฉ Sampling ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ์ ˆ์ฐจ: ๋ฌธ์žฅ์„ block ๋‹จ์œ„(B) ๋กœ ์ž˜๋ผ์„œ ์ ์ง„์ ์œผ๋กœ ํ™•๋ฅ ๋ถ„ํฌ p_\alpha์— ๋งž๊ฒŒ ์ƒ˜ํ”Œ๋ง
      • ๊ธธ์ด T์งœ๋ฆฌ ์ „์ฒด ๋ฌธ์žฅ์„ ํ•œ ๋ฒˆ์— ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ฒƒ์€ ๊ณ„์‚ฐ ๋ถˆ๊ฐ€ํ•˜๋ฏ€๋กœ, ๋Œ€์‹  B๊ธธ์ด๋งŒํผ์˜ k๊ฐœ ๋ธ”๋ก์œผ๋กœ ๋‚˜๋ˆ”
      • (k๊ฐœ ๋ธ”๋ก) * (๊ฐ ๋ธ”๋ก๋ณ„ ๊ธธ์ด B)๊นŒ์ง€์˜ joint likelihood(=p_\alpha)๋ฅผ ์ค‘๊ฐ„๋ชฉํ‘œ๋กœ ์„ค์ •
        • k=0์—์„œ ์‹œ์ž‘(์งง์€ ๋ฌธ์žฅ๋ถ€ํ„ฐ)ํ•ด์„œ
        • k=1์—์„œ ์ฒซ block ๊ธฐ์ค€์œผ๋กœ p_\alpha์— ๋งž๊ฒŒ ์ƒ˜ํ”Œ๋ง
        • ์ ์  ๊ธด ์‹œํ€€์Šค๋ฅผ ์œ„ํ•ด ์•ž์€ prefix๋กœ ๊ณ ์ • > ๋’ค์— B๊ธธ์ด์”ฉ ๋ถ™์—ฌ๊ฐ€๋ฉด์„œ MCMC*๋กœ ๋ชจ๋ธ๋ง
        • ์ตœ์ข… T ๋งŒํผ ๊ธธ์ด์— ๊ทผ์‚ฌ
      • ๊ฐ block๋งˆ๋‹ค Metropolisโ€“Hastings(MH) ์ ˆ์ฐจ๋กœ โ€œresample > accept/rejectโ€.
        • ์ด๋ฏธ ๋งŒ๋“ค์–ด์ง„ ๋ฌธ์žฅ ์ค‘ ์ผ๋ถ€ ํ† ํฐ์„ ๋ฌด์ž‘์œ„๋กœ ๊ณจ๋ผ์„œ p_prop์œผ๋กœ resampleํ•ด๋ณด๊ณ 
        • ๊ทธ ๊ฒฐ๊ณผ๊ฐ€ p(x)๋ณด๋‹ค ๋” ๋†’์€ likelihood๋ฉด ๊ต์ฒด(accept) ์•„๋‹ˆ๋ฉด ์œ ์ง€(reject)
        • p_prop: base model์— sampling temperature๋ฅผ 1/\alpha ๋กœ ์„ค์ •ํ•ด์„œ ์•ฝ๊ฐ„ sharpen๋œ ๋ฒ„์ „์œผ๋กœ์‚ฌ์šฉ. ์ฆ‰ ์ƒˆ ํ›„๋ณด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ์ œ์•ˆ ๋ถ„ํฌ
  • ) MCMC: ๋ณต์žกํ•œ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ์„ ์ง์ ‘ ๋ฝ‘๊ธฐ ์–ด๋ ค์šธ ๋•Œ, ์กฐ๊ธˆ์”ฉ ์›€์ง์ด๋ฉด์„œ ์ ์ง„์ ์œผ๋กœ ๊ทธ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋„๋ก
    • Markov Chain- ์ด์ „ ์ƒํƒœ์— ์˜์กดํ•ด์„œ ์ƒˆ ํ›„๋ณด ์ƒ์„ฑ
    • Monte Carlo- ๋žœ๋ค์ƒ˜ํ”Œ๋ง ๋ฐ˜๋ณต์ˆ˜ํ–‰
    • test-time scaling: MCMC step ์ˆ˜ N ๊ฐ€ ์ถ”๋ก ์‹œ๊ฐ„๊ณผ ์„ฑ๋Šฅ์˜ ํŠธ๋ ˆ์ด๋“œ-์˜คํ”„๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ, ์•ฝ 8.8๋ฐฐ ์ถ”๊ฐ€ ๊ณ„์‚ฐ๋Ÿ‰์ด ํ•„์š”ํ•˜์ง€๋งŒ training-free๋ผ๋Š” ์ ์„ ๊ฐ•์กฐ Fig 6
  • low-temperature sampling๊ณผ ์ฐจ์ด: ๊ฐ ์‹œ์ ์˜ ์กฐ๊ฑด๋ถ€ sharpen ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ˜๋ฉด, power dist.๋Š” ๋ฏธ๋ž˜ ํ† ํฐ ์ „์ฒด likelihood trajectory๋ฅผ ๊ณ ๋ คํ•œ ์ƒ˜ํ”Œ๋ง

Effects

  • Experimental setup:
    • backbone: Qwen2.5-Math-7B / Qwen2.5-7B / Phi-3.5-mini-instruct
    • baseline: Base / Low-temperature sampling / Power Sampling (์ œ์•ˆ๋ฐฉ์‹) / RL-post training (GRPO)
    • benchmark: MATH500(์ˆ˜๋ฆฌ์ถ”๋ก ), HumanEval(์ฝ”๋”ฉ), GPQA (๊ณผํ•™), AlpacaEval 2.0
    • domain ๊ตฌ๋ถ„;
      • in-domain: RL์ด ํ•™์Šต๋œ(๋˜๋Š” post-training๋œ) ํƒœ์Šคํฌ์™€ ๋™์ผํ•œ ์˜์—ญ
      • ood: RL ํ•™์Šต ๋•Œ ๋ณด์ง€ ๋ชปํ•œ, ๋„๋ฉ”์ธ์ด ๋‹ค๋ฅธ ์˜์—ญ
  • Results
    • Tab 1 / Fig 1 Training ์—†์ด๋„ single-shot reasoning ์„ฑ๋Šฅ ๊ฐœ์„  ํ™•์ธ
      • in-domain์ธ MATH500์€ ์ œ์•ˆ ๋ฐฉ์‹์ด RL ์ˆ˜์ค€์— ๊ทผ์ ‘ํ•˜๊ณ 
      • ood ์ธก๋ฉด์—์„œ HummanEval, AlpacaEval์ด RL ๋ณด๋‹ค 3-5%p ํ–ฅ์ƒ
    • Fig 4 Likelihood ํ™•์ธ
      • Power Sampling์€ base model์˜ ๋†’์€ likelihood ๊ตฌ๊ฐ„์„ ํญ๋„“๊ฒŒ ์ปค๋ฒ„
        • GRPO๋Š” ์ตœ๊ณ  likelihood ๊ทผ์ฒ˜๋กœ ์ง‘์ค‘(=mode collapse)
      • Confidence ๋ถ„ํฌ ์—ญ์‹œ GRPO๊ฐ€ ๊ฐ€์žฅ ๋†’์ง€๋งŒ ๋„ˆ๋ฌด ๋พฐ์กฑ..
        • Power Sampling์€ ๊ทธ ๋ฐ”๋กœ ์•„๋ž˜ ์ˆ˜์ค€์œผ๋กœ ๊ท ํ˜•์  sharpness ์œ ์ง€.
    • Fig 5 Diversity ํ™•์ธ: single + multi-shot ๋ชจ๋‘ ๊ฐœ์„ 
      • pass@k (multi-shot) ์„ฑ๋Šฅ: ์ œ์•ˆ์ด ๊ฐ€์žฅ ์ข‹์•˜๊ณ  GRPO>base ์ˆœ
        • RL์€ diversity collapse๋กœ k๋†’์ด๋ฉด ์„ฑ๋Šฅ ์ •์ฒด๋จ
      • ์ œ์•ˆ ๋ฐฉ์‹์€ k>1์—์„œ ๊พธ์ค€ํžˆ ์ƒ์Šน๋จ ํ™•์ธ

Personal note. test-time์—์„œ memory๋‚˜ preference ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด์„œ ๊ณ ๋ฏผํ•ด๋ณด๋‹ค๊ฐ€ ์ฝ์–ด๋ดค๋Š”๋ฐ, ์•„์ด๋””์–ด๊ฐ€ ๋‹จ์ˆœํ•ด๋ณด์—ฌ์„œ ๊ณต์œ ๋“œ๋ฆฝ๋‹ˆ๋‹ค.ย ย reasoning์€ ํ•™์Šต์˜ ์˜์—ญ์ด ์•„๋‹ˆ๋ผ ํƒ์ƒ‰ ๋ฌธ์ œ๋ผ๊ณ  (๊ฑฐ์น ๊ฒŒ) ์š”์•ฝํ•œ ํŽ˜์ดํผ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค. ์–ด์ œ ์„ธ๋ฏธ๋‚˜์—์„œ ์–ธ๊ธ‰ํ•œ ๋ฌธ์ œ ์—ญ์‹œ ์ผ๋ถ€ ์ƒํ†ตํ•˜๋Š” ๋ถ€๋ถ„์ด ์žˆ๋˜ ์ ๋„ ํฅ๋ฏธ๋กœ์› ๊ณ ์š”. ์ˆ˜ํ•™์  ํ•ด์„๋„ ๋‚˜๋ฆ„ ๊ผผ๊ผผํ•˜๊ธด ํ•œ๋ฐ, ์™œ high-likelihood sampling์ด reasoning correctness์™€ ์—ฐ๊ฒฐ๋˜๋Š”์ง€๋Š” ์•„์ง ์„ค๋ช…์ด ๋ถ€์กฑํ•˜๋‹ค๋Š” ์ธ์ƒ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (causation์ด ๋ถ€์กฑํ•œ๋ฐ ์–ด์ฉ” ์ˆ˜ ์—†๋‹ค๊ณ  ๋А๋‚๋‹ˆ๋‹ค.)