1 minute read

Meta info.
  • Authors: Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
  • Paper: https://arxiv.org/pdf/2501.18585
  • Affiliation: Shanghai Jiao Tong Univ., Soochow Univ., Tencent AI
  • Published: January 30, 2025

TL; DR

o1-like LLMs์ด ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ํ’€ ๋•Œ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ์‚ฌ๊ณ  ํ๋ฆ„์„ ์ž์ฃผ ๋ณ€๊ฒฝํ•˜๋Š” Underthinking ํ˜„์ƒ ๋ถ„์„

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

์ตœ์‹ ์˜ย o1-like LLMs(OpenAI o1, Qwen, DeepSeek-R1, โ€ฆ)์€ test-time compute๋ฅผ ํ™•์žฅํ•˜์—ฌ ๋ณต์žกํ•œ reasoning ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ๊ณ (Thought)์— ๋Œ€ํ•ด ์„ค๊ณ„๋จ.

Problem States

๊ทธ๋ ‡๋‹ค๊ณ  ๊ทธ๊ฒŒ Deep-thinking ์‹œํ‚ค๋Š” ๊ฑด ์•„๋‹˜

  • ์ถฉ๋ถ„ํžˆ ์ƒ๊ฐํ•˜์ง€ ์•Š๊ณ  ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ๊ณ  ํ๋ฆ„์„ ์ „ํ™˜ํ•˜๋Š” ๋ฌธ์ œ=underthinking์— ๋Œ€ํ•œ ํ•ด๊ฒฐ ํ•„์š”

Suggestions

  • Underthinkingย ํ˜„์ƒ ๋ถ„์„
    • ๋ชจ๋ธ์ด ์–ด๋ ค์šด ๋ฌธ์ œ์—์„œ ์‚ฌ๊ณ  ์ „๋žต์„ ์ž๊พธ ๋ฐ”๊พธ์ง€๋งŒ, ๊ทธ ๊ณผ์ •์—์„œ ์ถฉ๋ถ„ํžˆ ํƒ์ƒ‰ํ•˜์ง€๋Š” ์•Š์Œ. (figure 2)
    • ์˜ค๋‹ต ์ƒ์„ฑ์—์„œ Thought switching์ด ๋” ์ž์ฃผ ๋ฐœ์ƒ โ†’ย ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€๋กœ ์—ฐ๊ฒฐ (figrue 1)
    • ์ œ๋Œ€๋กœ ๋œ thought ๋ฐฉ์‹์„ ํƒํ–ˆ๋”๋ผ๋„ย ์™„์ „ํžˆ ํ•ด๊ฒฐํ•˜๊ธฐ ์ „์— ์ค‘๋‹จํ•˜๊ณ  ์ƒˆ๋กœ์šด ๋ฐฉ์‹์„ ์‹œ๋„ํ•จ โ†’ ๋ชจ๋ธ์ด ์˜ค๋‹ต ์ถœ๋ ฅ
  • UT Scoreย ์ œ์•ˆ
    • ์˜ค๋‹ต์—์„œ ์ •๋‹ต ๋„์ถœ์— ๊ธฐ์—ฌํ•˜๋Š” ์ตœ์ดˆ์˜ ์˜ฌ๋ฐ”๋ฅธ thought path๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ์ง€์ ์„ ๊ธฐ์ค€์œผ๋กœ ์ดํ›„์˜ ํ† ํฐ์ด ๋น„ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉ๋œ ์ •๋„
    • ๊ฐ’์ด ํด์ˆ˜๋ก underthinking๋จ
  • Thought Switching Penalty (TIP) ๋„์ž…
    • ๋””์ฝ”๋”ฉ ๊ณผ์ •์—์„œ ์‚ฌ๊ณ  ์ „ํ™˜์„ ์–ต์ œํ•˜๋„๋ก ํŒจ๋„ํ‹ฐ ๋ถ€์—ฌ.
      • ๋””์ฝ”๋”ฉ ๊ณผ์ •์—์„œ ์‚ฌ๊ณ  ํ๋ฆ„์„ ๋ฐ”๊พธ๋Š” ํ‚ค์›Œ๋“œ(โ€œalternativelyโ€,ย โ€œanother way to approach this isโ€ฆโ€œ)์˜ ์ถœํ˜„ ํ™•๋ฅ ์„ ๋‚ฎ์ถค
      • ํŠน์ • ์‹œ๊ฐ„ ๋™์•ˆ(ฮฒ) ์‚ฌ๊ณ  ์ „ํ™˜์„ ํ•˜์ง€ ์•Š๋„๋ก ํŒจ๋„ํ‹ฐ ๋ถ€์—ฌ

Effects

  • Experiments Set-up
    • target task: MATH500-Hard (๋Œ€ํ•™์ˆ˜์ค€ ์ˆ˜ํ•™๋ฌธ์ œ), GPQA Diamond (๋ฌผ๋ฆฌ, ํ™”ํ•™, ์ƒ๋ฌผํ•™ ๋‹ค์ง€์„ ๋‹ค ๋ฌธ์ œ), AIME (๊ณ ๋‚œ๋„ ์ˆ˜ํ•™ ๊ฒฝ์‹œ๋Œ€ํšŒ ๋ฌธ์ œ)
    • model
      • o1-like models: QwQ-32B-Preview, DeepSeek-R1-671B
      • general models: Qwen-Math-72B, Llama3.3-70B
  • Results
    • ์‚ฌ๊ณ  ์ „ํ™˜์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜๋ฉด ์˜ค๋‹ต ์ƒ์„ฑ (์ •ํ™•๋„ ํ•˜๋ฝ, figure 4)
    • ๋ฌธ์ œ์˜ ๋‚œ์ด๋„๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์‚ฌ๊ณ  ํ๋ฆ„์„ ๋ฐ”๊พธ๋Š” ๋นˆ๋„๊ฐ€ ์ฆ๊ฐ€: ๋ชจ๋ธ์ด ์˜ค๋‹ต ์ƒ์„ฑ์‹œ ์‚ฌ์‹ค์€ ์ •๋‹ต ๋„์ถœ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋Š” thought ์ „๋žต์„ ์ค‘๋„ ํฌ๊ธฐํ•จ. ์ฆ‰, ์‘๋‹ต ์ดˆ๊ธฐ์—๋Š” ์˜ฌ๋ฐ”๋ฅธ thought์˜€์œผ๋‚˜, ์™„์ „ํžˆ ํ•ด๊ฒฐํ•˜๊ธฐ ์ „์— ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์œผ๋กœ ์ „ํ™˜๋จ. (figure 5)
    • ๋ชจ๋“  o1-like LLMs๊ฐ€ Underthinking ๋ฌธ์ œ, ํŠนํžˆ ์–ด๋ ค์šด, ์˜ค๋‹ต์ด ๋งŽ์€ ๋ฌธ์ œ(AIME 2024)์—์„œ UT Score๊ฐ€ ๋งค์šฐ ๋†’์Œ (table 1)
    • TIP ์ ์šฉ ํ›„ ์ •ํ™•๋„ MATH500-Hard +1.5%, GPQA Diamond +2.2%, AIME 2024 +4.1% ์ฆ๊ฐ€ (table3)
  • Future Work
    • ๋ชจ๋ธ์ด ์Šค์Šค๋กœ ์‚ฌ๊ณ  ํ๋ฆ„์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” thought
    • ๋…ผ๋ฆฌ ํผ์ฆ, ๋ฌผ๋ฆฌํ•™ QA ๋“ฑ์—์„œ ํšจ๊ณผ ๊ฒ€์ฆ

Personal note. o1 ๋ชจ๋ธ์ด ๋ชปํ•˜๋Š” ๊ฑฐ ๋ถ„์„ํ•˜๊ณ  ํ•ด๊ฒฐํ•˜๋Š” ๋…ผ๋ฌธ ๋งŽ์ด ๋‚˜์˜ค๋Š”๋ฐ ๋„ˆ๋ฌด๋นจ๋ผ์„œ ์ฒด๊ฐ์ด ์ž˜ ์•ˆ๋˜๋„ค์š”,,,ย ๐Ÿขย  ์˜ˆ์‹œ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ๋ฌธ์ œ๋“ค๋„ ๋‚˜๋„ ๊ฐ™์ด ๊นŠ์ƒ ํ•ด์•ผ๋งŒ ํ’€ ์ˆ˜ ์žˆ๋Š” (ํ•ด๋„ ๋ชปํ‘ธ๋Š”) ๋ฌธ์ œ๊ฐ€ ๋Š˜์–ด๋‚˜๋Š”์ค‘,,