3 minute read

Meta info.
  • Authors: Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong
  • Paper: https://arxiv.org/pdf/2506.21605
  • Affiliation: Huawei, Renmin Univ.
  • Published: June 20, 2025

TL; DR

multi-scenario (participation & observation) + multi-level (factual & reflective) ๋ฉ”๋ชจ๋ฆฌ ์œ ํ˜• ํ†ตํ•ฉ, multi-metric evaluation๋ฅผ ์‚ฌ์šฉํ•˜๋Š” LLM-based agent์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์ธ MemBench ์†Œ๊ฐœ

image 1 image 2 image 3 image 4 image 5 image 6 image 7 image

Background

LLM-based memory๋Š” annotation ํ˜น์€ task-based metrics์œผ๋กœ ํ‰๊ฐ€๋˜๊ณ , participation setting์—์„œ factual memory ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ observation ํ˜น์€ reflective memory์— ๋Œ€ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค ๊ณ ๋ ค ๋ชปํ•จ. (accuracy ์ด์ƒ์œผ๋กœ ํ‰๊ฐ€๋˜์ง€ ๋ชปํ•ด์˜ด)

  • LongMemEval, LoCoMo ๋“ฑ ์ตœ๊ทผ ์—ฐ๊ตฌ๋Š” ์—ฌ์ „ํžˆ factual memory์— ํ•œ์ •, reflective ํ•œ ์ถ”๋ก ์ด๋‚˜ agent์˜ ์ˆ˜๋™์„ฑ์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜์ง€ ๋ชปํ•จ.

Problem States

๋ฒค์น˜๋งˆํฌ๋Š” ์•„๋ž˜ ์„ธ ๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•ด์•ผ ํ•œ๋‹ค

  • active participation๋ฟ๋งŒ ์•„๋‹ˆ๋ผ passive observation์„ ํฌ๊ด„ํ•˜๊ณ ,
  • factualํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ reflective memory๋ฅผ ํฌํ•จํ•˜๋ฉด์„œ
  • accuracy, recall, efficiency, memory capacity ๋“ฑ์„ ํ‰๊ฐ€

Suggestions

  • ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ์ „๋žตย #1 user relation graph: user ์ž์‹  + ๊ด€๋ จ์ž/์žฅ์†Œ/์‚ฌ๊ฑด/์•„์ดํ…œ ๋“ฑ entity
    • ๋Œ€ํ™” ์ปจํ…์ŠคํŠธ์— ํ•„์š”ํ•œ ์‚ฌ์‹ค ์ •๋ณด ๋ฐ ์„ฑ์ฐฐ ์ •๋ณด ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๊ตฌ์ถ•
    • entitiy ๋ณ„ property ๋ณ„๋„
    • Reflective Memory๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ์ž ์ทจํ–ฅ ๋ฐ˜์˜
      • ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ฐ์ดํ„ฐ์…‹(MovieLens, Food, Goodreads ๋“ฑ์˜ ์‹ค์ œ ๊ณต๊ฐœ ์ถ”์ฒœ ๋ฐ์ดํ„ฐ) ํ™œ์šฉ
      • ์ข‹์•„์š” ๋˜๋Š” ํ‰์  ๋†’์€ ํ•ญ๋ชฉ๋“ค > LLM์ด ์ƒ์œ„ ์„ ํ˜ธ๋„(high-level preference) ์ถ”์ถœ
      • ์ƒ์œ„ ์ทจํ–ฅย โ†”๏ธย ํ•˜์œ„ ์‚ฌ์‹ค ์†์„ฑ ๊ฐ„ 1:N ์‚ฌ์ „(dictionary) ๊ตฌ์„ฑ
      • ์˜ˆ: โ€˜Salted Maple Ice Creamโ€™, โ€˜Pecan Pralineโ€™ ๋“ฑ์„ ์ข‹์•„ํ•จ > ์ทจํ–ฅ: โ€˜Sweet and Saltyโ€™ > โ€œSweetโ€ : [Apple Pie, Pecan Pie, Honeyโ€ฆ]
  • ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ์ „๋žตย #2 Dialogue Session + QA pair
    • Observation:ย Agent๋Š” ์‚ฌ์šฉ์ž ๋ฉ”์‹œ์ง€๋ฅผ ๋“ฃ๊ธฐ๋งŒ ํ•˜๊ณ  ๋ฐ˜์‘ํ•˜์ง€ ์•Š์Œ
      • ๋ฉ”์‹œ์ง€๋Š” ๋‹จ์ˆœ ์ง„์ˆ ๋ฌธ์œผ๋กœ ์ƒ์„ฑ (LLM ์žฌ์ž‘์„ฑ ์‚ฌ์šฉ)
      • input: โ€œIโ€™ll go to the Build Start event next weekโ€>rewrite: โ€œMy Build Start 2024 is happening next week on Monday at 7:00 PM.โ€
    • Participation:ย Self-dialogue ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ์ž์™€ ์—์ด์ „ํŠธ ๊ฐ„ ๋‹ค์ค‘ ํ„ด ๋Œ€ํ™” ์ƒ์„ฑ
      • Assistant๋Š” ์ •๋‹ต์„ ๋ชจ๋ฅด๊ณ  (์ •๋ณด์„ฑ๊ฒฉ ์—†๋Š”)๋ฐ˜์‘๋งŒ ํ•จ (์‹œ๋‚˜๋ฆฌ์˜ค ๊ธฐ๋ฐ˜ prompt๋กœ ์ƒ์„ฑ)
      • key evidence ๋ฌธ์žฅ์„ ์ค‘๊ฐ„ ์‚ฝ์ž…: ๋‹ค์–‘ํ•œ reasoning ์œ ํ˜•์„ ๋ฐ˜์˜ํ•œ ๋Œ€ํ™” ์„ค๊ณ„ (QA Pair์—์„œ ํ™œ์šฉ)
      • ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ ๊ธฐ๋ฐ˜ ์„ธ์…˜ ๋ถ„ํ• :
        • ๋™์ผ ์„ธ์…˜ ๋‚ด์—์„œ๋Š” ์งง์€ ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ (1๋ถ„ ๋‹จ์œ„)
        • ์„ธ์…˜ ๊ฐ„์—๋Š” ๊ธด ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ (ํ•˜๋ฃจ ๊ฐ„๊ฒฉ ๋“ฑ) ๋ถ€์—ฌ
    • QA Pair: ๋‹จ์ผ/๋‹ค์ค‘ hop, ๋น„๊ต, ์ง‘๊ณ„, ์ •์„œ ์š”์•ฝ ๋“ฑ
      • ๋ฏธ๋ฆฌ ์‚ฝ์ž…๋œย evidence๊ฐ€ ์žˆ๋Š” Dialogue๋กœ๋ถ€ํ„ฐ ๋„์ถœ ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„
      • MCQ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ •ํ™•๋„ ํ‰๊ฐ€ ์šฉ์ด์„ฑ ํ™•๋Œ€
  • ํ‰๊ฐ€์šฉ sub datset
    • Sub-dataset 1(์ผ๋ฐ˜ ํ…Œ์ŠคํŠธ์šฉ): ํ‰๊ท  10k tokens/session
    • Sub-dataset 2(์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ํ…Œ์ŠคํŠธ์šฉ): ํ‰๊ท  100k tokens/session
    • ๋…ธ์ด์ฆˆ ์‚ฝ์ž…: ํ‰๊ฐ€ ์‹œ ์ผ๋ถ€ ์„ธ์…˜์— ๋ฌด๊ด€ํ•œ ์ •๋ณด(๋‰ด์Šค ๋Œ€ํ™” ์„ธ์…˜) ์‚ฝ์ž…ํ•ด ์šฉ๋Ÿ‰ ํ•œ๊ณ„(๋ฉ”๋ชจ๋ฆฌ ๋ฆฌํ…์…˜/์œ ์ง€๋ ฅ) ํ…Œ์ŠคํŠธ

Effects

  • ํ‰๊ฐ€ ๋ฐฉ๋ฒ•: ์—์ด์ „ํŠธ๋Š” ์‹œ๊ฐ„ ์ˆœ์„œ์— ๋”ฐ๋ผ ๋งค turn ๋Œ€ํ™”๋ฅผ ๋ฐ›์œผ๋ฉฐ ํ•™์Šตํ•จ
    • ๊ฐ ํ…Œ์ŠคํŠธ์—์„œ Q์— ๋Œ€ํ•ด, ๊ณผ๊ฑฐ tโˆ’1 ๋ผ์šด๋“œ๊นŒ์ง€์˜ ๋‚ด์šฉ์€ memory module์„ ํ†ตํ•ด์„œ๋งŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ, ํ˜„์žฌ t ๋ผ์šด๋“œ์—๋Š” ์ƒˆ๋กœ์šด ์ •๋ณด ์ฃผ์–ด์ง
    • ์ฆ‰, Memory Module์ด ๋˜‘๋ฐ”๋กœ ์ž‘๋™๋˜๋ฉด ์˜ˆ์ „ ์ •๋ณด๋ฅผ ์ž˜ ๊บผ๋‚ด์“ฐ์ง€๋งŒ, ์‹คํŒจํ•˜๋ฉด ์ด์ „ ์ •๋ณด๋ฅผ ์žŠ์–ด๋ฒ„๋ฆฐ ๊ฒƒ์ฒ˜๋Ÿผ ์‘๋‹ตํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ
      • Q์— ๋Œ€ํ•ด ๋งž๋Š” ๋‹ต์„ ํ•˜๋ฉด ๊ธฐ์–ต์„ ์œ ์ง€ํ–ˆ๋‹ค๊ณ  ํŒ๋‹จ, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ๊ธฐ์–ต ์œ ์ง€ ์‹คํŒจ๋กœ ๊ฐ„์ฃผ
  • baseline
    • FullMemory: ๋ชจ๋“  ๋Œ€ํ™” ๊ธฐ์–ต, ๋น ๋ฅด์ง€๋งŒ ์œˆ๋„์šฐ ์ œํ•œ
    • RecentMemory: ์ตœ๊ทผ ์ •๋ณด๋งŒ ๊ธฐ์–ต, ๊ฐ€์žฅ ๊ฐ„๋‹จํ•˜์ง€๋งŒ ์ž‘์€ window size
    • RetrievalMemory: embedding ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ = retrieval ์„ฑ๋Šฅ์— ์ขŒ์šฐ
    • GenerativeAgent: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ์ถ”๋ก  ๊ธฐ์–ต, ๋Œ€ํ™”ํ˜• ๊ธฐ์–ต ๋ณต์›
    • MemoryBank: ์™ธ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉํ•˜์—ฌ LC ์‚ฌ์šฉํ•˜๋‚˜ write ๋А๋ฆผ
    • MemGPT: ์šด์˜์ฒด์ œํ˜• ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค, read๊ฐ€ ๋А๋ฆผ
    • SCMemory: self-controlled memory, ๋ถˆ์•ˆ์ •ํ•˜๋‹ค๊ณ  ๋ณด๊ณ ๋จ
  • Results
    • Factual Memoryย Tab3
      • RetrievalMemory๊ฐ€ ํ•ญ์ƒ ์ตœ๊ณ  ์„ฑ๋Šฅ, ํŠนํžˆย Observationย ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ฐ€์žฅ ์•ˆ์ •์ 
      • FullMemory, RecentMemory๋Š” token ์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๋ฉด window size ๋ฐ–์œผ๋กœ ๋ฐ€๋ ค์„œ ๊ธ‰๊ฐ
      • MemGPT๋Š” ์ •ํ™•๋„๋Š” ๋‚ฎ์ง€๋งŒ ๊ธด context ๋Œ€์‘์€ ํ‰๊ท  ์ด์ƒ
    • Reflective Memoryย Tab4ย ๋Œ€ํ™” ์ค‘์—์„œ ์ง์ ‘ ํ‘œํ˜„๋˜์ง€ ์•Š์€ ๊ณ ์ฐจ์›์  ๋งฅ๋ฝ (์„ ํ˜ธ๋‚˜ ๊ธฐ๋ถ„ ๋“ฑ)๋ฅผ ๊ธฐ์–ตํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
      • ์ทจํ–ฅ ๊ธฐ์–ต์€ ๋Œ€๋ถ€๋ถ„ ์ž˜ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ, ๊ฐ์ • ๊ธฐ์–ต์€ ๋ชจ๋“  ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์–ด๋ ค์›Œํ•จ
      • GenerativeAgent๋Š” reflective ๊ธฐ์–ต ์ฒ˜๋ฆฌ์— ๊ฐ•์  (์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜)
      • MemoryBank๋Š”ย Observationย ์‹œ๋‚˜๋ฆฌ์˜ค์— ํ•œํ•ด ๊ฐ•๋ ฅํ•˜์ง€๋งŒ ์†๋„ ๋А๋ฆผ
    • Memory Capacityย Fig5: Sub-dataset 2 ์˜ย Observationย ์‹œ๋‚˜๋ฆฌ์˜ค ์‚ฌ์šฉ: ํ•ญ์ƒ Evidence๊ฐ€ ์•ž์—, ์งˆ๋ฌธ์ด ๋’ค์— ๋“ฑ์žฅ
      • Retrieval ๊ธฐ๋ฐ˜์€ ์ •ํ™•๋„ ์ ์ง„์  ๊ฐ์†Œ
      • SCMemory๋‚˜ MemGPT ๊ฐ™์€ ๊ตฌ์กฐํ™”๋œ ๊ฒฝ์šฐ๋Š” ์ผ์ • ํ† ํฐ ์ˆ˜ ์ดํ›„ ๊ธ‰๊ฒฉํžˆ ์„ฑ๋Šฅ ํ•˜๋ฝ > ๋ฉ”๋ชจ๋ฆฌ ์„ค๊ณ„์˜ ์Šค์ผ€์ผ ๋Œ€์‘๋ ฅ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ ์ง€์† ์—ฌ๋ถ€๊ฐ€ ๊ฐˆ๋ฆฐ ๊ฒƒ์œผ๋กœ ํ•ด์„ ๊ฐ€๋Šฅ
    • Backbone LLM ๋น„๊ตย Tab5ย : Qwen2.5-7B-Instruct, GPT-4o-mini, Meta-LLaMA-3.1-8B, glm-4-9b-chat
      • GPT-4o-mini๋Š” reflective memory ํŠนํžˆ ๊ฐ•ํ•จ
      • Meta-LLaMA๋Š” factual์—๋Š” ์•ฝํ•˜์ง€๋งŒ reflective์—” ๊ฝค ๊ดœ์ฐฎ์€ ์„ฑ๋Šฅ
      • glm์€ overall ๋‚ฎ์€ ์„ฑ๋Šฅ, ํŠนํžˆ factual์— ์ทจ์•ฝ

Personal note. ํ‰๊ฐ€ ๋ฐฉ์‹์„ ๋‹จ์ˆœํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„ํ•œ ๋ถ€๋ถ„์ด takeaway ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ‰๊ฐ€๋ฅผ agent response๋กœ ํ•˜๋Š” ๊ฒƒ์˜ ์–ด๋ ค์›€์— ๋Œ€ํ•ด์„œ๋Š” ์ตํžˆ ๋А๊ปด์˜จ ๋ฐ” ์žˆ๊ณ , ์ด๋ฅผ ์šฐํšŒํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„์ ‘ ์งˆ์˜(QA pair)๋ฅผ ํ•จ๊ป˜ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ธ ์ถ”์„ธ์ธ ๊ฒƒ์„ ์žฌํ™•์ธํ•˜๋Š” ๊ธฐํšŒ์ด๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ ๊ณต๊ฐœ๋Š” ๋ณด๋ฆ„์ด ์ฑ„ ์•ˆ๋œ ๊ฒƒ์— ๋น„ํ•ด ๋ฌด๋ ค ๋ฐ˜๋…„ ์ „ ๋งˆ์ง€๋ง‰ ์ปค๋ฐ‹์ด ์ฐํžŒ ๋ ˆํฌ์ง€ํ† ๋ฆฌ๊ฐ€ ๊ณต๊ฐœ๋˜์–ด์žˆ๊ธฐ๋Š” ํ•œ๋ฐ ๋ฆฌ๋“œ๋ฏธ ํฌํ•จ ์„ค๋ช…์ด ์•„์ง ์—†์–ด์„œ ์ฐจ์ฐจ ๋ฐ์ดํ„ฐ๋Š” ์‚ดํŽด๋ณผ ์—ฌ์ง€๊ฐ€ ์žˆ์–ด๋ณด์ž…๋‹ˆ๋‹ค. ์•„์šธ๋Ÿฌ ์ •๋ณด ์—†๋Š” ๋ฐ˜์‘ = reaction์„ ๊ณ ๋ คํ•œ ๊ฒƒ๋„ ์ผ์ƒ๋Œ€ํ™”์Šค๋Ÿผ์ง ํ•˜๋‹ค๊ณ  ๋А๋‚๋‹ˆ๋‹ค๋งŒ, ๊ตฌ์ถ• ๋ฐฉ์‹ ์ž์ฒด๋Š” ๊ตฌ์‹์œผ๋กœ ๋ณด์—ฌ์„œ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ์žˆ์„์ง€ ๊ณ ๋ฏผํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.