1 minute read

Meta info.
  • Authors: Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou
  • Paper: https://arxiv.org/pdf/2508.10874
  • Affiliation: Shanghai AI Lab, Shanghai Jiao Tong Univ., Tsinghua Univ., UCL
  • Published: August 14, 2025

TL; DR

๊ฒ€์ƒ‰์—”์ง„์ด๋‚˜ ๋‹ค๋ฅธ LLM ๋“ฑ ์™ธ๋ถ€ tool ์—†์ด ๊ฒ€์ƒ‰์„ Full-simulationํ•ด์„œ RL โ†’ real-world๋กœ ์ „์ด ๊ฐ€๋Šฅํ•œ self-search ๋ชจ๋ธ ๊ตฌ์ถ•

image.png

image.png

image.png

image.png

Background

  • LLM์˜ Reasoning ๋Šฅ๋ ฅ ํ–ฅ์ƒ์— ๋”ฐ๋ฅธ math/coding task์—์„œ์˜ ์„ฑ๊ณต
  • ODQA(search-based) task์—์„œ์˜ reasoning์„ ์œ„ํ•ด์„œ๋Š” ๋ณดํ†ต ์™ธ๋ถ€ tool ํ™œ์šฉ
    • Search-R1, Kimi V2 ๋“ฑ์€ ๊ฒ€์ƒ‰์—”์ง„ api ๋‹ต๋ณ€์œผ๋กœ RLโ†’ ์ˆฑํ•œ rollouts (e.g. search API calls)์— ๋Œ€ํ•œ ํ•™์Šต ๋น„์šฉ ๋ถ€๋‹ด
    • ZeroSearch ๋“ฑ์€ web search api ๋Œ€์‹  LLM api ํ™œ์šฉํ•ด์„œ ํ‰๋‚ด๋‚ด๊ธฐ๋„

Problem States

๊ฒ€์ƒ‰์—”์ง„์ด๋‚˜ ๋‹ค๋ฅธ LLM ๋“ฑ ์™ธ๋ถ€ tool ์—†์ด searchํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ ํ•™์Šต

  • ๊ฒ€์ƒ‰ ์Šคํƒ€์ผ QA์—์„œ ๋‚ด๋ถ€ ์ง€์‹๋งŒ์œผ๋กœ๋Š” ์–ด๋””๊นŒ์ง€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ๊นŒ
  • ๊ฒ€์ƒ‰ api ์•ˆ์“ฐ๊ณ ๋Š” ๊ฒ€์ƒ‰ ๋ชป๋ฐฐ์šฐ๋‚˜
    • full-simulation ํ™˜๊ฒฝ์—์„œ๋งŒ ํ•™์Šตํ•œ ๋ชจ๋ธ์ด real-world์—์„œ ๊ฒ€์ƒ‰์„ ์ง„์งœ ํ•  ์ˆ˜ ์žˆ์„๊นŒ

Suggestions

  • Self-search ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
    • seed data: NQ, HotpotQA
    • backbone: LLaMA-3.1-8B-Instruct, Qwen2.5-14B-Instruct
    • process
      1. seed์—์„œ question๋งŒ input์œผ๋กœ
      2. , , , ํƒœ๊ทธ ๋ถ™์—ฌ์„œ ์ƒ์„ฑ
        • think: CoT
        • search: ๊ฒ€์ƒ‰์–ด
        • information: ๋ชจ๋ธ์ด ์ž์ฒด์ƒ์„ฑํ•œ fake search ๊ฒฐ๊ณผ
        • answer: final answer
      3. ๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์˜ gold answer์™€ **๋™์น˜**์ธ์ง€ ํ™•์ธ โ†’ **outcome reward**๋กœ ์‚ฌ์šฉ
  • SSRL
    • objectives: GRPO (์™ธ policy optimization ํ™•์ธ)
    • reward:
      • outcome reward: ์ •๋‹ต์ด๋ฉด +1 ์˜ค๋‹ต์ด๋ฉด -1
      • format reward: ํƒœ๊ทธ ์ž˜ ๋ถ™์˜€์„ ๋•Œ ๊ฐ€์‚ฐ ( $\lambda_f=0.1$ )
    • methods:
      • (A) ๊ฐ€์งœ ๊ฒ€์ƒ‰(Self-Search๋งŒ): ์™ธ๋ถ€ ๊ฒ€์ƒ‰ ํ˜ธ์ถœ์ด ์•„์˜ˆ ํ•„์š” ์—†์Œ โ†’ ๋น ๋ฅด๊ณ  ๋น„์šฉ 0
        • ๋‹น์—ฐํžˆ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋“ค์–ด์žˆ๋Š” ์ง€์‹๊นŒ์ง€๋งŒ ์ปค๋ฒ„ โ†’ ์ตœ์‹  ์ •๋ณด, ํฌ๊ท€ ์‚ฌ์‹ค์€ ๋ชป์žก์„ ๊ฒƒ
      • (B) ์‹ค์ œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ๊ต์ฒด (Sim2Real): ์„ฑ๋Šฅ์ด ๋” ํ™•์‹คํ•˜๊ฒŒ ํ–ฅ์ƒ.

Effects

  • Experiments Setup:
    • metrics: pass@k
    • benchmarks and search engine: Tab 2
  • results:
    1. Self-search ๊ธฐ์ค€, LLM-only์—์„œ ์–ผ๋งˆ๋‚˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์ผ์–ด๋‚ ๊นŒ
      1. self-search๋งŒ ํ•ด๋„ ์ผ๋ถ€ task์—์„œ๋Š” SOTA๊ฑฐ๋‚˜ ํ‰๊ท  ์ตœ๊ณ ์„ฑ๋Šฅํ™•์ธ Tab 3
      2. scaling: k ์ˆ˜ ๋Š˜์ˆ˜๋ก self-search ์„ฑ๋Šฅ ํฌ๊ฒŒ ํ–ฅ์ƒ
      3. Bamboogle์—์„œ LLaMA-3.1-8B-Instruct ๊ธฐ์ค€ pass@1=34.9% โ†’ pass@1024=87.2%๋กœ ํ–ฅ์ƒ
      4. k๊ฐ€ ๋Š˜ ๋•Œ o1์ด๋‚˜ Search-R1๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ€ํŒŒ๋ฅธ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํ™•์ธ
    2. Sim2Real: ๊ฒ€์ƒ‰ํ•œ ์ฒ™๋งŒ ํ•™์Šตํ•ด๋„ ์‹ค์ œ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•œ๊ฐ€? Tab 4
      • Search-R1์ด๋‚˜ ZeroSearch ๊ฐ™์€ ์ „ํ†ต ๊ฒ€์ƒ‰-RL ๋ชจ๋ธ๋ณด๋‹ค ์ ์€ call๋กœ ๋” ์ข‹์€ ์„ฑ๋Šฅ
      • ์—”ํŠธ๋กœํ”ผ ๊ธฐ๋ฐ˜ ํŠธ๋ฆฌ๊ฑฐ(โ€œ๋ถˆํ™•์‹คํ•˜๋ฉด ๊ฒ€์ƒ‰ํ•˜๋ผโ€) ์“ฐ๋ฉด call์ˆ˜๋ฅผ 20โ€“40% ์ค„์ด๋ฉด์„œ๋„ ํ‰๊ท  ์ ์ˆ˜๋Š” ์œ ์ง€
      • ๊ฒ€์ƒ‰ํšŸ์ˆ˜๋Š” 3ํšŒ์ •๋„, ๋‹ค ์“ด๊ฑฐ์™€ ๋น„์Šท