1 minute read

Meta info.
  • Authors: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez
  • Paper: https://www.arxiv.org/pdf/2502.08235
  • Affiliation: ETH Zurich, UC Berkeley
  • Published: February 12, 2025

TL; DR

LRMs์ด overthinkingํ•˜๊ฒŒ ๋˜๋ฉด agentic ํ™˜๊ฒฝ๊ณผ ์ œ๋Œ€๋กœ ์ƒํ˜ธ์ž‘์šฉํ•˜์ง€ ๋ชปํ•˜๋Š” Reasoning-Action Dilemma๊ฐ€ ๋ฐœ์ƒ๋˜๊ณ , ์ด๋Š” ์„ฑ๋Šฅ ํ•˜๋ฝ์„ ์ดˆ๋ž˜ํ•œ๋‹ค๋Š” ๊ฒฐ๊ณผ ๋ณด๊ณ 

image.png

image.png

image.png

image.png

Background

non-agentic ํ™˜๊ฒฝ์—์„œ CoT Reasoning/self-verification ๋“ฑ์œผ๋กœ ๊ณ ์ฐจ์› ์ถ”๋ก  ๊ฐ€๋Šฅ

  • agenticํ™˜๊ฒฝ์—์„œ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด์„œ ์ง์ ‘์ ์œผ๋กœ ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ํ™œ์šฉ(๋‚ด๋ถ€ ์ถ”๋ก )ํ•  ์ˆ˜ ์žˆ์„๊นŒ?

Problem States

LRMs์ด ์ง€๋‚˜์น˜๊ฒŒ ๋‚ด๋ถ€ ์ถ”๋ก ์— ๊ณผ์˜์กดํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๊ณ , ์ด๋Ÿฌํ•œ ๊ฒฝํ–ฅ์ด ์‹ค์ œ ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ฐฉํ•ดํ•˜์—ฌ ์„ฑ๋Šฅ ์ €ํ•˜ ์ดˆ๋ž˜ (= overthinking)

  • Fig4-aย ๊ณ„ํš์„ ๊ณผํ•˜๊ฒŒ ์„ธ์šฐ๊ณ  ์ •์ž‘ ์‹คํ–‰์€ ๋ชปํ•˜๊ฑฐ๋‚˜
  • Fig4-bย feedback์„ ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ  ์—ฌ๋Ÿฌ ํ–‰๋™์„ ํ•œ๊บผ๋ฒˆ์— ํ•˜๊ฑฐ๋‚˜
  • Fig4-cย ํ”ผ๋“œ๋ฐฑ ์—†์ด(๋ชป๊ธฐ๋‹ค๋ฆฌ๊ณ ) ์ž„์˜๋กœ ์ž‘์—…์ข…๋ฃŒ
  • RQ1ย overthinking์ด ์‹ค์ œ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€
  • RQ2ย overthinkingํ•˜๋Š” ๊ฒฝํ–ฅ์ด ๋ชจ๋ธ ์œ ํ˜•์ด ๋”ฐ๋ผ ๋‹ค๋ฅธ๊ฐ€
  • RQ3ย ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€

Suggestions

  • overthinking score: LLM-as-a-judge๋กœ 0~10์  ์ฒ™๋„ ์ •์˜, 4018๊ฐœ ๋ชจ๋ธ์˜ trajectories๋ฅผ ๋ถ„์„ํ•˜์—ฌ scoring
    • ์‹ค์ œ ์ „๋ฌธ๊ฐ€ ํ‰๊ฐ€์™€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ ๋ณด๊ณ 
  • overthinking ์™„ํ™”๋ฒ•
    • ๋‹คํšŒ ์ƒ์„ฑ ํ›„ overthinking score๊ฐ€ ๋‚ฎ์€ trajectories ์„ ํƒ: ์„ฑ๋Šฅ ํ–ฅ์ƒ 30% + ๋น„์šฉ์€ 43% ๊ฐ์†Œ
    • function calling(FC) ํ™œ์šฉ: overthinking์ค„์ด๋Š” ๋ฐ์— ์œ ๋ฆฌํ–ˆ๋‹ค๊ณ . o1 ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ fC์‹œ ์„ฑ๋Šฅ ํ–ฅ์„ฑ 47.7%
    • Selective Reinforcement Learning ๋“ฑ..

Effects

  • RQ1ย overthinking score ๋†’์„์ˆ˜๋ก ๋ฌธ์ œ ํ•ด๊ฒฐ๋ ฅ ๋–จ์–ด์ง (ํšŒ๊ท€๋ชจํ˜•์œผ๋กœ ํ™•์ธ)
    • ํŠนํžˆ reasoning ๋ชจ๋ธ์ด non-reasoning ๋ชจ๋ธ๋ณด๋‹ค overthinking ์ ์ˆ˜ 1.3์  ์ด์ƒ ๋†’์Œ (3.5 vs. 2.2)
  • RQ2ย ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์ž‘์„์ˆ˜๋ก overthinking ๋”ํ•˜๋Š”๋“ฏ.
    • ์™ธ๋ถ€ ํ™˜๊ฒฝ Feedback ์ฒ˜๋ฆฌ ์—ญ๋Ÿ‰ ๋ถ€์กฑ์œผ๋กœ ๋ณด์ž„ (๊ทธ๋ž˜์„œ ๋‚ด๋ถ€์— ์˜์กดํ•˜๋Š”๋“ฏ)
  • RQ3ย Function Calling ์˜ํ–ฅ
    • FC ์‹œ์ผœ์ฃผ๋ฉด ์„ฑ๋Šฅ์€ 29.1% โ†’ 47.7%, overthinking score๋Š” 2.43 โ†’ 1.05

Personal note. over-thinking์„ ์ •๋Ÿ‰ํ™”ํ•ด๋ณธ ์ตœ์ดˆ ์‹œ๋„๋กœ ๋ณด์ด๊ณ , ์ด๋ฅผ ์‹ค์ œ๋กœ ์–ต์ œํ•  ์ˆ˜ ์žˆ๋Š” ์ดˆ๊ธฐ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ ์˜์˜๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. Function calling์ด ์—ฌ๊ธฐ์„œ๋„ ์œ ์˜ํ•˜๋‹ค๊ณ  ํ•˜๋„ค์š”ย ๐Ÿค”