LC-LLM RAG: Long-Context LLMs Meet RAG

October 17, 2024 2 minute read

Meta info.

Authors: Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik
Paper: https://arxiv.org/pdf/2410.05983
Affiliation: Google Cloud
Published: October 8, 2024

TL; DR

LC-LLM을 RAG에서 쓸 때, (1) context 순서를 잘 주고 (2) RAG 느낌을 튜닝시켜주고 (3) 명시적으로 relevant 여부를 판단하도록 reasoning step 주면 더 잘한다.

Problem States

LC-LLM이 RAG system에서 retrieved context 개수가 지나치게 많아지만 생성 성능 하락하는 문제 발생

Research Question:
1. RAG에서 LC-LLM 사용할 때 retrieved context 양이 많을수록 일관되게 성능이 향상되는가? > 그건 아님
2. (RQ1에서 관찰된) performance bottleneck이 retriver의 한계인가, 아니면 (검색된 컨텍스트를 효과적으로 활용할 수 있는) LC-LLM의 능력에 한계 때문인가? > 아마 LLM의 한계
3. (해당 LC-LLM의 한계를 개선시키려면) 일반적으로 RAG system에서는 high recall이 기본 = hard negative 포함 가능성 증가되는 것 때문일것 같다. 1) (이 가정이 맞는가?) 현재의 LC-LLM이 이러한 Hard Negative에 얼마나 robust한가? > 취약하다 2) (맞다면) Hard Negative의 영향은 사용되는 retriever에 따라 달라지는가? > 그렇다

Suggestions

Observation:
- (RQ1) RAG에서 LC-LLM 사용할 때 retrieved context 양이 많다고 성능이 항상 좋아지는건 아니므로, 그 외의 요소를 고민해봐야된다.
  - Figure 1: NQ로 확인한 결과, strong retriever를 사용할 경우 RAG 성능이 concave한 형태를 띄지만, weak retriever를 쓰면 우상향하거나 약간만 감소
- (RQ2) performance bottleneck은 LC-LLM의 한계다.
  - Figure 2: RAG의 전반적인 accuracy가 모든 retrieved context 양에 대해 recall보다 낮다는 점에서 미루어, 정답을 줘도 LC-LLM이 못받는다고 보는 것이 맞다는 해석
    - 즉, irrelevant context (hard negative)가 크리티컬할 수 있다.
  - retriever로 e5쓰는 경우, retrieved context가 많을수록 BM25에 비해 성능 저하가 컸다고 .
- (RQ3) Hard negative의 중요성
  - Figure 3: 모든 LLM에서 hard negative context가 증가하면 일반적으로 RAG 성능 감소
    - LLMs: Gemma2-7B-Chat, Mistral-Nemo-12B-Instruct, Gemini-1.5-Pro
    - hard negative context 구성: gold phrase(정답 구절) + hard negative retrieved context (e5, Contriever, BM25, random sampling)
  - retriever의 성능이 hard negative 난이도와 직접적인 상관성
    - LLM은 weak retriever(BM25 or random sampling)의 context 보다 strong retriever (e5)의 hard negative context에 더 challenge (당연하긴 한데, 치명성을 보여주고 싶었던 듯)
Methods:
1. lost-in-the-middle 해소를 위한 Reranking: [Instruction, rank_1, rank_3, … rank_4, rank_2] 등으로 배치
2. fine-tuning for implicit robustness: noisy한 retrieved context를 처리하는 것은 pretraining 단계에서 안배우므로, finetuning 해야된다. (hard-negative에 대한 robustness)
3. fine-tuning for explicit robustness: LLM이 명시적으로 relevant 여부를 판단하도록 intermediate reasoning을 추가적으로 수행해야한다. (역시 튜닝)

Effects

(Suggestion 1): reranking은 retrieved context가 많을수록 유익 Figure 4
- Gemma-2-9B-Chat & Mistral-Nemo-12B-Instruct have tested NQ / PopQA with retrieved context by BM25 or e5
- lost-in-the-middle 해소 및 Hard negative context에 대한 전략적 처리를 가능하게 하는 것으로 보임
  - 즉 RAG에서 engineering 적으로 접근하는 것의 중요성에 대한 역설과 같다고 해석 가능
(Suggestion 2): implicit robustness를 위한 Finetuning의 유효성 Figure 5
- NQ, WoW, Fever, MMLU등으로 RAG style tuning하고, 그 때 안본 QA set으로 평가했을 때, 큰 폭으로 성능 개선 확인
  - 해당 QA set으로 직접튜닝하는 것보다 일관되게 더 나은 효과
(Suggestion 3): 명시적으로 relevant 여부를 판단시키는 것이 최종 성능 향상에 유익 Figure 6

Personal note. RAG논문이 Google Research 에서 안나오고 Cloud에서 나오는 경향이, 더욱 엔지니어링에 가까워졌다는 의미가 아닐지…?