2 minute read

Meta info.
  • Authors: Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li, Yuzhi Zhang, Jianxin Li, Ziwei Zhang
  • Paper: https://arxiv.org/abs/2507.23581
  • Affiliation: Beihang Univ., HKUST, Huawei, Nankai Univ.
  • Published: July 31, 2025

TL; DR

RL(GRPO)์— 2๊ฐ€์ง€ constrained reward(RPA + CAF) ์ ์šฉํ•˜์—ฌ GraphRAG agent ํ•™์Šต > ๊ฒ€์ƒ‰ํ•  ๋•Œ ์ž…๋ ฅ์œผ๋กœ triplet๊ณผ ์ž์—ฐ์–ด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ™œ์šฉํ•˜์—ฌ multi-hop QA์—์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํ™•์ธ

image.png

image.png

image.png

image.png

image.png

Background

  • GraphRAG ๋Œ€๋‘๋˜๊ณ  ์žˆ๊ธด ํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ multihop QA์—์„œ ์‹คํŒจ : ๋‹จ์ˆœ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰์ด๋‚˜ heuristics์— ์˜์กด
  • RAG์—์„œ์˜ RL: DeepSeek-R1์ด๋‚˜ R1-Searcher๋“ฑ์—์„œ RL์ด think-then-retrieve ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ณด๊ณ .

Problem States

GraphRAG์—์„œ RL ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ ํ™•์ธ > multi-hop QA ์„ฑ๋Šฅ ๊ฐœ์„ 

  • graphRAG์˜ heuristics์— ๋Œ€ํ•œ ํ•œ๊ณ„
  • outcome-only based reward์˜ hacking ์šฐ๋ ค > ์–•์€ ๊ฒ€์ƒ‰ ํ˜น์€ ๋ฐ˜๋Œ€๋กœ over-thinking ์œผ๋กœ ์ด์–ด์งˆ๊ฒƒ
  • long-input์œผ๋กœ๋Š” ๋น„์šฉ ํ•œ๊ณ„

Suggestions

  • GRPO ๊ฐœ์„ : w/Rollout-with-Thinking + Retrieval-Masked Loss
    • Rollout-with-Thinking:
      • ๋ฐ”๋กœ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ reasoning ๊ณผ์ •์—์„œ ์ ์ ˆํ•œ ์‹œ์ ์—์„œ ํ•„์š”ํ•  ๋•Œ โ€ฆ</end_of_query> ์ƒ์„ฑ
      • ๊ฒ€์ƒ‰ ์‹คํ–‰: (๋…ผ๋ฌธ์—์„œ ๊ธฐ๋ณธ ์„ค์ •์€ HippoRAG2)
      • ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ text snippet์€ ๋‹ค์Œ ์ถ”๋ก ์„ ์œ„ํ•ด < begin_of_documents >โ€ฆ< end_of_documents > ์‚ฌ์ด์— ์ถ”๊ฐ€
    • Retrieval-Masked Loss
      • text snippet์€ maskํ•˜์—ฌ gradient ๊ณ„์‚ฐ์—์„œ ์ œ์™ธ
      • ๋ชจ๋ธ ์Šค์Šค๋กœ๊ฐ€ ์ƒ์„ฑํ•œ Reasoning์— ๋Œ€ํ•ด์„œ๋งŒ optimization
      • ์˜๋„: retriever์˜ ์™ธ๋ถ€ ํ…์ŠคํŠธ์— ์˜์กด ์—†์ด ์•ˆ์ •์  ํ•™์Šต, ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹๋งŒ ๋ฐฐ์›€
  • Reward Design:ย Process-Constrained Rewards:
    1. format: Retrieval calling format์ด ๋งž์„ ๋•Œ 0.5 ๋ฆฌ์›Œ๋“œ ๋ถ€์—ฌ
    2. PRA(Progressive Retrieval Attenuation):ย ์ฒซ ํ˜ธ์ถœ์€ย ๊ธฐ๋ณธ ๋ณด์ƒย ํฌ๊ฒŒ ์ฃผ๊ณ , ์ดํ›„ ๊ฒ€์ƒ‰ ํ˜ธ์ถœ๋งˆ๋‹คย ์ง€์ˆ˜์ ์œผ๋กœ decay๋œ ๋ณด์ƒ ๋ˆ„์  = shallow retrieval(๋„ˆ๋ฌด ์ ์€ ๊ฒ€์ƒ‰๊ณผ ๋ฌดํ•œ ๊ฒ€์ƒ‰ ๋™์‹œ์—) ๋ฐฉ์ง€
    3. CAF(Cost-Aware F1):ย over-thinking ๋ฐฉ์ง€ํ•˜๊ณ ์ž ์ตœ์ข… ๋‹ต์˜ f1-score์— ๊ฒ€์ƒ‰ ํšŸ์ˆ˜ *ย ๋น„์šฉ ํŒจ๋„ํ‹ฐ >ย ๊ฐ™์€ ์ •ํ™•๋„๋ผ๋ฉด ๊ฒ€์ƒ‰ ํšŸ์ˆ˜๊ฐ€ ์ ์„์ˆ˜๋ก ๋ณด์ƒ ํ–ฅ์ƒ
  • 3-phrase training: cold-start SFT(retriever calling ํ•™์Šต) > behavior shaping with format +PRA (๊ฒ€์ƒ‰์„ ์–ธ์ œ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ํ• ์ง€) > smartness optimization w/CAF (์ •๋‹ต ์ •ํ™•๋„์™€ ํšจ์œจ ๊ท ํ˜•)
  • hybrid retrieval: triple + ์ž์—ฐ์–ด ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ํ•™์Šต๊ณผ ์ถ”๋ก  ๋ชจ๋‘์— ํ™œ์šฉ

Effects

  • Experimental Setup: ์ฃผ๋กœ Qwen-2.5-7B์— retriever๋กœ HippoRAG2
    • target datasets: HotpotQA, 2Wiki, MuSiQue, PopQA
    • metrics: F1, SBERT similarity, LLM-as-Judge Accuracy
  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ ํ™•์ธย Tab 1
  • ablation
    • PRA ์ œ๊ฑฐ: call ์ˆ˜๋Š” ์ค„์ง€๋งŒ ๊ฒ€์ƒ‰ depth๊ฐ€ ์–•๊ณ  F1 ํ•˜๋ฝ
    • CAF ์ œ๊ฑฐ: call ์ˆ˜ ๊ณผ๋„ํ•œ๋ฐ F1 ํ–ฅ์ƒ ์—†์Œ
    • 2-reward ๋ชจ๋‘ ํ•„์ˆ˜์ . reward ๊ฒฐํ•ฉ์‹œ task ๋‚œ์ด๋„์— ๋”ฐ๋ฅธ ๊ฒ€์ƒ‰ ์กฐ์ • ํšจ๊ณผ
      • Hotpot/PopQA์—์„œ๋Š” call ์ˆ˜ ๊ฐ์†Œ
      • MuSiQue/2Wiki์—์„œ๋Š” call ์ˆ˜ ์ฆ๊ฐ€
    • phrase-training: cold start ์ƒ๋žตํ•˜๋“  ๋ชจ๋“  loss/reward ํ•œ๋ฒˆ์— ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ œ์•ˆํ•œ 3๋‹จ๊ณ„ ๋ฐฉ์‹์ด ๊ฐ€์žฅ ์šฐ์ˆ˜
      • ์ดˆ๊ธฐ์— format๊ณผ calling ๋ฐฉ์‹์„ ๋ฐฐ์šฐ๋Š”๊ฒŒ ์ตœ์ ํ™”์— ์œ ๋ฆฌ
    • hybrid ๋ฐฉ์‹์˜ ๊ฒ€์ƒ‰์ด ํ•™์Šต๋ฉด์—์„œ๋„ best F1, triple ๋น„์œจ ๋†’์ด๋ฉด F1 ์‚ด์ง ์†์‹ค์€ ์žˆ์ง€๋งŒ ํ† ํฐ์„ ํฌ๊ฒŒ ๋ฒŒ ์ˆ˜ ์žˆ์Œ
    • backbone์ด๋‚˜ retriever ๋ญ๋กœ ๋ฐ”๊ฟ”๋„ ์ œ์•ˆ ๋ฐฉ์‹์˜ ๊ฐ•๊ฑดํ•œ ๊ฐœ์„ ํšจ๊ณผ ํ™•์ธ

Personal note. ํ™”์ž ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ณด๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ RL์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€? ์— ๋Œ€ํ•ด ๊ณ ๋ฏผํ•ด๋ณด๋ ค๊ณ  RL์„ task์— ์ ์šฉํ•˜๋Š” ํŽ˜์ดํผ ๋ˆˆ์— ๋„๋Š” ๊ฒƒ ์œ„์ฃผ๋กœ ํ™•์ธํ•˜๊ณ  ์žˆ๋Š”๋ฐ, optimization ๋‹จ๊ณ„์—์„œ ์ผ๋ถ€๋ฅผ maskํ•˜๊ฑฐ๋‚˜, loss๋ฅผ ๋‹ค์ธต์œ„๋กœ ์„ค๊ณ„ํ•ด์„œ ๋‹จ๊ณ„๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹์€ ์œ ์ตํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.