2 minute read

Meta info.
  • Authors: Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Liโ€ , Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng
  • Paper: https://arxiv.org/pdf/2511.20857
  • Affiliation: Google DeepMind, UIUC
  • Published: November 25, 2025

TL; DR

LLM Agent๊ฐ€ test-time์— ๊ณผ๊ฑฐ ๊ฒฝํ—˜์„ ์Šค์Šค๋กœ ์ง„ํ™”์‹œํ‚ค๋ฉฐ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” streaming benchmark Evo-Memory ์ œ์•ˆ, ExpRAG / ReMem ๊ฐ™์€ baseline์„ ์ œ์•ˆํ•˜์—ฌ ๊ฒฝํ—˜ ์žฌ์‚ฌ์šฉ ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋Œ€ํ•œ ๋น„๊ต ํ‰๊ฐ€ ๊ธฐ๋ฐ˜ ์ œ์‹œ

image 1 image 2 image 3 image 4 image 5 image 6 image 7 image 8 image

Background

  • LLM์˜ memory system์€ ๋Œ€์ค‘ํ™”๋˜์—ˆ๋Š”๋ฐ๋„ ์—ฌ์ „ํžˆ ๊ธฐ์–ต์„ recallํ•˜๋Š” ๊ฒƒ์„ ์ค‘์‹ฌ์œผ๋กœ ์—ฐ๊ตฌ ์ง„ํ–‰
  • recallํ•œ ๊ธฐ์–ต์„ ์ž˜ ์“ฐ๋Š”์ง€์— ๋Œ€ํ•œ ํ‰๊ฐ€ ๋ถ€์กฑ: test-time learning / self-evolving memory ๋“ฑ ์‹ค์ œ๋กœ ์ž˜ ์ฐพ์•„๋‹ค๊ฐ€ ์จ์„œ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๋Š”์ง€
    • recall: ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ๋‹จ์ˆœํžˆ ๋‹ค์‹œ ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ
    • (experience) reuse: ๊ณผ๊ฑฐ ์ž‘์—…์—์„œ ํ•™์Šตํ•œ ํ•ด๊ฒฐ ์ „๋žต์„ ์ƒˆ๋กœ์šด ์ž‘์—…์— ์ ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ

Problem States

reuse๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ์œ„ํ•œ benchmark ์ œ์‹œ

  • ์ •์  memory: append > retrieval๋กœ ์Šค์Šค๋กœ memory๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜๊ฑฐ๋‚˜ ๊ฐœ์„ ํ•˜์ง€๋Š” ์•Š์Œ (์ดˆ๊ธฐ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€)
  • memory๊ฐ€ ์–ด๋–ป๊ฒŒ ์„ฑ๋Šฅ์„ ๋ณ€ํ™”์‹œํ‚ค๋Š”์ง€๋Š” ํ™•์ธ๋˜์ง€ ์•Š์Œ
  • TTL(test time learning)์— ๋Œ€ํ•œ memory ์—ฐ๊ตฌ ๋ถ€์กฑ
  • Research Objective
    1. ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์„ ์•„์šฐ๋ฅด๋Š” streaming bench ์ œ์‹œ
    2. ์—ฌ๋Ÿฌ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ๊ฐ™์€ ํ”„๋กœํ† ์ฝœ์—์„œ ๋น„๊ต
    3. test-time์—์„œ agent๊ฐ€ ์ง„์งœ learnํ•˜๋Š”์ง€ ์ •๋Ÿ‰์  ์ธก์ • ์‹œ๋„

Suggestions

  • memory-augmented agent ๊ณต์‹ํ™” : RAG, LTM(long-term memory), workflow memory๋“ฑ ํ†ตํ•ฉ
    • t ์‹œ๊ฐ„์— x_t ์ž…๋ ฅ์— ๋Œ€ํ•ด
    • R: retrieval module โ†’ R_t = R(M_t, x_t)
    • C: Context Constructor โ†’ \tilde{C}_t = C(x_t, R_t)
    • F: base LM โ†’ prediction \hat{y} = F(\tilde{C}_t)
    • U: Memory Update pipeline
      • ๊ฒฝํ—˜ m_t = h(x_t, \hat{y}_t, f_t) ์ƒ์„ฑ
      • M_{t+1} = U(M_t, m_t)
  • dataset์„ streaming taks๋กœ ์žฌ์ •์˜
    • ๊ฐ ๋ฒค์น˜๋งˆํฌ๋Š” task trajectory \tau = {(x_1,y_1),โ€ฆ, (x_T,y_T)} ๋กœ ๋ณ€ํ™˜ ๊ฐ€๋Šฅ, ์ด์ „ task ๊ฒฝํ—˜ ์ดํ›„ task ์„ฑ๋Šฅ์— ์˜ํ–ฅ ๋ฏธ์น˜๋„๋ก ์„ค๊ณ„
  • ExpRAG: Experience-level RAG
    • baseline์œผ๋กœ์„œ ๊ฐ ๊ฒฝํ—˜์„ structured text๋กœ ์ •๋ฆฌ > ํ˜„์žฌ task์™€ ์œ ์‚ฌํ•œ ๊ณผ๊ฑฐ ๊ฒฝํ—˜์„ top-k retrieval, Memory update๋Š” append ๋ฐฉ์‹
    • ๊ณผ๊ฑฐ task๋ฅผ ICL context์ฒ˜๋Ÿผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„
  • ReMem: Thinkโ€“Actโ€“Refine ๊ธฐ๋ฐ˜ Self-Evolving Memory Agent
    • ReAct ๊ตฌ์กฐ ํ™•์žฅํ•˜์—ฌ Refine (=๋ฉ”๋ชจ๋ฆฌ ์กฐ์ž‘) action ์ถ”๊ฐ€
      • Think: ๋‚ด๋ถ€ reasoning ์ƒ์„ฑ
      • Act: API/tool/์ตœ์ข… ๋‹ต๋ณ€ ํ–‰๋™
      • Refine: memory ๊ฒ€์ƒ‰ ์š”์•ฝ ์‚ญ์ œ ์žฌ๊ตฌ์„ฑ ์ˆ˜ํ–‰
    • ๊ฐ ์ƒํƒœ์—์„œ agent๊ฐ€ action ์ค‘ ํ•œ๊ฐ€์ง€ ํƒ1

Effects

  • Tasks
    • single-turn reasoning + QA: MMLU-Pro, GPQA-Diamond, AIME-24/25, ToolBench
    • multi-turn interaction(AgentBoard): AlfWorld, BabyAI, ScienceWorld, Jericho, PDDLBench
  • Backbone: Gemini 2.5 Flash, Flash-lite, Pro / Claude 3.5 Haiku, 3.7 Sonnet
  • Baseline:
    • no-memory: History, ReAct, Amem
    • Adaptive memory: SelfRAG, MemOS, Mem0, LangMem
    • Procedural memory: DC(Dynamic Cheatsheet), AWM
    • ์ œ์•ˆ(ours): ExpRecent, ExpRAG, ReMem
  • Metrics:
    • Accuracy, Exact Match
    • Success Rate(S), Progress(P)
    • ํ‰๊ท  step ์ˆ˜
    • Task sequence robustness
    • Task similarity correlation
    • Memory pruning ratio
  • Results:
    • Tab1 single-turn
      • ExpRAG, ReMem์ด ๋Œ€๋ถ€๋ถ„ ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค ๋†’์€ ํ‰๊ท  ์ƒํšŒ
      • ReAct๋Š”Gemini Flash ๋“ฑ์—์„œ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์ €ํ•˜
      • ReMem์€ backbone ์•ˆํƒ€๊ณ  ๋ชจ๋‘์—์„œ ์ตœ๊ณ  ํ˜น์€ ๊ทธ์— ์ค€ํ•˜๋Š” ์„ฑ๋Šฅ
    • Tab2 multi-turn
      • ReMem์ด AlfWorld, ScienceWorld ๋“ฑ์—์„œ ํฐ ํญ ์„ฑ๋Šฅ ํ–ฅ์ƒ
      • task ์ˆ˜ํ–‰ ์œ„ํ•œ ๊ธธ์ด๊ฐ€ ๊ธธ์ˆ˜๋ก self-evolving memory ํšจ๊ณผ ์ฆ๋Œ€
    • Fig4 ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์–ธ์ œ ์œ ์ตํ•œ๊ฐ€: Task similarity๊ฐ€ ๋†’์„์ˆ˜๋ก (๋™์งˆ์ ์ธ ๊ฒŒ ๋ฐ˜๋ณต๋ ์ˆ˜๋ก) ReMem์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ํฌ๋‹ค
    • Fig5 ๋ชจ๋“  bench์—์„œ ReMem์€ ํ‰๊ท  step ์ˆ˜ ๊ฐ์†Œ = ๋” ํšจ์œจ์ 
    • Tab3 Task ์ˆœ์„œ๋ฅผ ์‰ฌ์šด๊ฑฐ > ์–ด๋ ค์šด๊ฑฐ ํ˜น์€ ๊ทธ ๋ฐ˜๋Œ€์— ๋Œ€ํ•ด
      • baseline๋“ค์ด ๋ถˆ์•ˆ์ •ํ•œ๊ฑฐ์— ๋น„ํ•ด ReMem์€ ๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘ ๋†’์€ robustness ์œ ์ง€
    • Tab4 ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ ์ด์Šค๋ผ์ธ๋“ค์€ noise์— ์ทจ์•ฝํ•˜๋‚˜, ReMem์€ ์‹คํŒจ ๊ฒฝํ—˜์ด ์„ž์—ฌ๋„ S/P๊ฐ€ ์•ˆ์ •์  ์œ ์ง€
      • refine action์ด ์ด๋ฅผ ์ •์ œํ•˜๋Š” ๋“ฏ
    • Fig6 task index๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ReMem์˜ cumulative success curve๊ฐ€ History๋ณด๋‹ค ๊พธ์ค€ํžˆ ๋†’์Œ = test-time learning์˜ ์ง์ ‘์  ์ฆ๊ฑฐ๋ผ ์ฃผ์žฅ
    • Fig7 Dataset์— ๋”ฐ๋ผ pruning ๋น„์œจ ์ฐจ์ด:
      • GPQA ~36.8% : ๋‹ค์–‘๋„๊ฐ€ ๋†’์•„ ๋ฉ”๋ชจ๋ฆฌ ํ•„ํ„ฐ๋ง ํ•„์š”ํ•œ๋ฐ ๋ฐ˜ํ•ด
      • AIME ~10-17% :๊ฒฝํ—˜ ๋Œ€๋ถ€๋ถ„ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ–ˆ๋‹ค๊ณ 

Personal note. ์ง€๊ธˆ๊นŒ์ง€ memory ์—ฐ๊ตฌ์—์„œ ๋‹ต๋‹ตํ•˜๊ณ  ๋ชจํ˜ธํ•˜๋˜ ๋ถ€๋ถ„๋“ค์ด ์ด ๋…ผ๋ฌธ์„ ํ†ตํ•ด์„œ ํ•œ๊ฒฐ ์ •๋ฆฌ๋œ ๋А๋‚Œ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์—ฐ๊ตฌ์˜ baseline ์„ ์ •ํ•˜๋Š” ๊ณผ์ •์—์„œ ์ฐธ๊ณ ํ•ด๋ณผ๋งŒํ–ˆ๊ณ , ๋””ํ…Œ์ผ๋„ ๋” ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.