2 minute read

Meta info.
  • Authors: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky
  • Paper: https://arxiv.org/pdf/2410.04343
  • Affiliation: Google DeepMind
  • Published: October 6, 2024

TL; DR

LM์˜ RAG inference ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ scaling ์ „๋žต์„ ์ œ์•ˆํ•˜๊ณ , ์œ ํšจ ์ปจํ…์ŠคํŠธ ๊ธธ์ด์˜ ๊ทœ๋ชจ์™€ RAG ์„ฑ๋Šฅ ๊ฐ„์— ์„ ํ˜•์ ์ธ ๊ด€๊ณ„๊ฐ€ ์žˆ์Œ์„ ํ™•์ธ

image.png

image.png

image.png

image.png

image.png

image.png

Background

๋ชจ๋ธ์ด ๋ฐ›์•„๋“ค์ผ ์ˆ˜ ์žˆ๋Š” ๊ธธ์ด๊ฐ€ ๊ธธ๋‹ค๊ณ  ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค.

  • LC-LLM์ด๋”๋ผ๋„ ์—ฌ์ „ํžˆ ๊ทธ ๊ธด context๋ฅผ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๊ณ  ์žˆ์Œ.
  • retrieved context๊ฐœ์ˆ˜๊ฐ€ ์ผ์ • ์ˆ˜์ค€ ์ด์ƒ์ด๋ฉด ์„ฑ๋Šฅ ํ–ฅ์ƒ๋˜์ง€ ๋ชปํ•˜๊ณ  ์‹ฌ์ง€์–ด๋Š” ์ €ํ•˜๋˜๋Š” ๋ฌธ์ œ ๋ณด๊ณ 

Problem States

LC-LLM์ด RAG system์—์„œ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํš๋“ํ•˜๊ณ (์ถ”๊ฐ€ํ•˜๊ณ ) ํ™œ์šฉํ•˜๋Š” scaling ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๊ฐ€?

  • ๋‹จ์ˆœํžˆ input ๊ธธ์ด ํ™•์žฅํ•˜๋Š” ๊ฒƒ ์ด์ƒ์œผ๋กœ RAG ์ถ”๋ก ์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ „๋žต ํƒ๊ตฌ์˜ ํ•„์š”์„ฑ
  • Research Question
    1. ์ตœ์  ๊ตฌ์„ฑ์‹œ, inference computation์˜ scaling์€ RAG ์„ฑ๋Šฅ์— ์–ด๋–ค ์ด์ ์ด ์žˆ๋Š”๊ฐ€?
    2. RAG ์„ฑ๋Šฅ๊ณผ inference parameters๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜์—ฌ, ์ฃผ์–ด์ง„ ์˜ˆ์‚ฐ์— ๋Œ€ํ•œ ์ตœ์ ์˜ ํ…Œ์ŠคํŠธ ์‹œ๊ฐ„ compute allocation์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์„๊นŒ?
      • inference parameters: ๊ฒ€์ƒ‰ ๋ฌธ์„œ ๊ฐœ์ˆ˜(k), context์— demonstration ์ˆ˜(m), ์ƒ์„ฑ ๋ฐ˜๋ณตํšŸ์ˆ˜(n) ๋“ฑ

Suggestions

  • Inference Scaling Strategies for RAG
    • DRAG(Demonstration-Based RAG): RAG demonstration์„ ์ถฉ๋ถ„ํžˆ ์ œ๊ณตํ•˜์—ฌ ICL style๋กœ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„
    • IterDRAG(Iterative Demonstration-Based RAG): ์งˆ๋ฌธ์„ ํ•˜์œ„์งˆ๋ฌธ์œผ๋กœ ๋ถ„ํ•ดโ†’๊ฒ€์ƒ‰โ†’ ์ƒ์„ฑ์„ ๋ฐ˜๋ณตํ•˜๋Š” multi-hop query์˜ inference chain ์„ค๊ณ„
  • Inference scaling laws: RAG Performance์™€ Inference Computation Scale ์‚ฌ์ด ์ƒ๊ด€๊ด€๊ณ„ ์ •๋Ÿ‰ํ™”
    • ์œ ํšจ context ๊ธธ์ด(effective input context length to the LLM)์˜ ์ค‘์š”์„ฑ
      • LLM์ด ์ตœ์ข… ๋‹ต๋ณ€์„ ์ถœ๋ ฅํ•˜๊ธฐ ์ „๊นŒ์ง€ ๋ชจ๋“  ๋ฐ˜๋ณต ๊ณผ์ •์— ๊ฑธ์ณ ์ž…๋ ฅ๋œ ํ† ํฐ์˜ ์ด ์ˆ˜
      • ๊ธธ์ด ํ™•์žฅ์˜ ์ด์ ๊ณผ ๊ณ„์‚ฐ ๋น„์šฉ ์ฆ๊ฐ€ ์‚ฌ์ด trade-off: ์ฃผ์–ด์ง„ ๊ณ„์‚ฐ ์˜ˆ์‚ฐ ๋‚ด์—์„œ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํฌ์ฐฉ
      • vanilla RAG๋Š” 1ํšŒ ํ˜ธ์ถœ์ด ๊ธฐ๋ณธ: ์œ ํšจ context ๊ธธ์ด = prompt ๊ธธ์ด (์ตœ๋Œ€ LLM input length)
      • ์ œ์•ˆ ๋ฐฉ์‹ ๋“ฑ iterativeํ•˜๊ฒŒ ํ˜ธ์ถœํ•˜๋Š” ๊ฒฝ์šฐ: ์œ ํšจ context ๊ธธ์ด (๋ฌดํ•œํžˆ) ํ™•์žฅ ๊ฐ€๋Šฅ
    • Computation allocation model: constraints ํ•˜์—์„œ RAG์— ๋Œ€ํ•œ ์ตœ์ ์˜ inference parameters ์˜ˆ์ธก (5.1์ ˆ)
      • inference parameters ์กฐํ•ฉ์— ๋”ฐ๋ผ RAG ์„ฑ๋Šฅ ๋ฐ scaling inference computation scale ํ™•์ธ โ†’ ์ตœ์  ํ• ๋‹น ๊ฐ€๋Šฅ

Effects

  • Experimental setup:
    • datasets: multi-hop, knowledge intensive QA ๋ฐ single-hop QA๋“ฑ..
      • Bamboogle, HotpotQA, MuSiQue, 2WikiMultiHopQA
    • backbone: Gemini 1.5 Flash
      • zs-QA, ms-QA, vanilla RAG, DRAG, IterDRAG ๋น„๊ต
      • ์œ ํšจ context length: 16k, 32k, 128k, 1M, 5M ๋“ฑ
  • Results
    • DRAG, IterDRAG ๋‘ ์ œ์•ˆ ๋ฐฉ์‹์ด SOTA
      • vanilla RAG ๋Œ€๋น„ DRAG, IterDRAG ์ „๋žต์ด QA ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ๋Œ€ 58.9% ์„ฑ๋Šฅ ํ–ฅ์ƒ
      • IterDRAG์˜ ๊ฒฝ์šฐ CoT๋Œ€๋น„ ์ผ๊ด€์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ํ™•์ธ ๊ฐ€๋Šฅ
    • Computation allocation model์€ unseen domain์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•  ๋•Œ 96.6%์˜ ์ตœ์  ์„ฑ๋Šฅ
      • ๋‹จ์ˆœ ๋ฌธ์„œ ๊ธธ์ด ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๋ณด๋‹ค ์ตœ์  ํ• ๋‹น ์‹œ ์ถ”๋ก  ์—ฐ์‚ฐ์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ (= ์ œ์•ˆํ•˜๋Š” computation allocation model๋กœ ๊ณ„์‚ฐํ•œ ์œ ํšจ context ๊ธธ์ด๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ) RAG ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ์„ ํ˜•์ ์œผ๋กœ ํ™•์žฅ
    • 1M ํ† ํฐ์„ ์ดˆ๊ณผํ•˜๋ฉด ์„ฑ๋Šฅ ์ด๋“์ด ๊ฐ์†Œ โ†’ LC-LLM์˜ ํ•œ๊ณ„์ธ๋“ฏ..
    • ๊ฒ€์ƒ‰์ด ์ž˜๋ชป๋˜๊ฑฐ๋‚˜ ์ถ”๋ก  ๊ณผ์ •์ด ๋ถˆ์™„์ „ํ•œ ๊ฒฝ์šฐ ์˜ค๋ฅ˜ ๋ฐœ์ƒ