7 minute read

Meta info.
  • Authors: Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, Xiuying Chen
  • Affiliation: MBZUAI, RIKEN AIP, UT Austin, Wuhan University
  • Paper: https://arxiv.org/abs/2605.05583 (arXiv:2605.05583v2)
  • Published: May 8, 2026 (arXiv preprint)
Slide 1 Slide 2

TL; DR

agent memory๊ฐ€ ๊ด€์ธก์„ ๋‹จ์ผ deterministic ๊ฒฐ๋ก ์œผ๋กœ collapse ํ•ด์„œ self-reinforcing error๋ฅผ ๋งŒ๋“ ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์งš๊ณ , ํ›„๋ณด ๊ฒฐ๋ก ๋“ค์„ ํ™•๋ฅ ๊ณผ ํ•จ๊ป˜ ์œ ์ง€(belief)ํ•˜๋ฉฐ noisy-OR๋กœ ๊ฐฑ์‹  + retrieval ์‹œ ๋ถ„ํฌ ์ „์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” BeliefMem ์ œ์•ˆ

Figure 1: Deterministic memory vs. BeliefMem (API timeout example) Figure 2: BeliefMem overview (Update / Retrieval / Action) Table 1: LoCoMo results (F1 / BLEU-1) Table 2: ALFWorld results (SR / #Steps) Figure 4: Corpus-size robustness + belief convergence Figure 3: Adversarial memory correction Table 3: Ablation studies Figure 5: Average token consumption

Background

  • LLM agent๊ฐ€ long-horizon / multi-session task๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ, ์„ธ์…˜์„ ๋„˜์–ด ์ง€์‹์„ ์ถ•์ ํ•˜๊ธฐ ์œ„ํ•ด persistent external memory์— ์˜์กด (Hu et al., 2025)
  • 2๊ฐ€์ง€ ๊ธฐ์กด memory ๊ณ„์—ด: storage ๊ด€๋ฆฌ์™€ retrieval ์ „๋žต์—์„œ๋งŒ ์ฐจ์ด, ๋ณธ์งˆ์ ์œผ๋กœ memory representation์€ ๋ชจ๋“  entry๋ฅผ noisy/ambiguous ๊ด€์ธก์—์„œ ์ถ”๋ก ํ•œ ๋‹จ์ผ categorical ๊ฒฐ๋ก (deterministic)์œผ๋กœ ์ €์žฅ, ๋ชจ๋“  ์—ฐ์‚ฐ์ด all-or-nothing
    • factual memory: ์‚ฌ์šฉ์ž/ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๊ด€์ธก์„ structured entry๋กœ ๊ธฐ๋ก (๋ฌด์—‡์„ ๋ดค๋Š”๊ฐ€)
    • self-improving memory: ๊ณผ๊ฑฐ ๊ฒฝํ—˜์—์„œ actionable lesson์„ distill (๋ฌด์—‡์„ ๋ฐฐ์› ๋Š”๊ฐ€)
      • Reflexion (Shinn et al., 2023): ์‹คํŒจ ๊ฒฝํ—˜์—์„œ self-corrective guidance ์ƒ์„ฑ
      • ExpeL (Zhao et al., 2024): trajectory ์ „๋ฐ˜์˜ recurring pattern์„ insight๋กœ ์ง‘๊ณ„
      • Voyager (Wang et al., 2023) / MemSkill (Zhang et al., 2026): ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ skill library ํ™•์žฅ
    • RL-based memory (Memory-R1, MEM1, Agentic Memory, MemRL): add/update/delete๋ฅผ ํ•™์Šต๋œ policy๋กœ ๋Œ€์ฒด
  • POMDP / belief state ๋ฐฐ๊ฒฝ: ๋ณธ์งˆ์ ์œผ๋กœ POMDP (Kaelbling et al., 1998)
    • POMDP: agent๋Š” ์„ธ๊ณ„์˜ true state๋ฅผ ์ง์ ‘ ๋ณด์ง€ ๋ชปํ•˜๊ณ  user messageยทtool output ๊ฐ™์€ partialยทnoisy ๊ด€์ธก๋งŒ ๋ฐ›์Œ
      • POMDP์—์„œ ๋ถˆํ™•์‹ค์„ฑ์€ hidden state์— ๋Œ€ํ•œ ํ™•๋ฅ ๋ถ„ํฌ์ธ belief state๋กœ ํ‘œํ˜„๋จ
    • ์ตœ๊ทผ LLM agent๋ฅผ partial observability ํ•˜์—์„œ ๋ณด๋Š” ์—ฐ๊ตฌ (Belief Engine, CoBelWorld, PABU ๋“ฑ) ๋“ฑ์žฅ
      • memory system์€ ์ด ํ•จ์˜๋ฅผ ๋ฌด์‹œ: ๊ด€์ธก์„ ๊ณง ground truth๋กœ ๋“ฑ์น˜์‹œ์ผœ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹จ์ผ ๊ฒฐ๋ก ์œผ๋กœ collapse

Problem States

deterministic memory์˜ ํ‘œํ˜„ ์ž์ฒด๊ฐ€ partial observability์™€ ๋งž์ง€ ์•Š์•„ ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ผ ์˜ค๋ฅ˜ ๋ˆ„์  + ํ™•๋Œ€

  • Deterministic Bottleneck: ๊ฐ latent attribute์— ๋Œ€ํ•ด point estimate(๊ฐ€์žฅ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ฐ€์„ค ํ•˜๋‚˜)๋งŒ ์ €์žฅํ•˜๊ณ  ๋‚˜๋จธ์ง€ ํ›„๋ณด์™€ ๊ทธ ํ™•๋ฅ  ํ๊ธฐ โ†’ ์™„์ „ํ•œ belief๊ฐ€ ๋‹ด๊ณ  ์žˆ๋˜ ๋ถˆํ™•์‹ค์„ฑ์ด ์‚ฌ๋ผ์ง
  • Self-Reinforcing Error: ์ €์žฅ๋œ ๋‹จ์ผ ๊ฒฐ๋ก ์„ agent๊ฐ€ ํ–‰๋™์˜ ๊ทผ๊ฑฐ๋กœ ์‚ผ์Œ โ†’ ํ๊ธฐ๋œ ๋Œ€์•ˆ ๊ฐ€์„ค์„ ์‹œํ—˜ํ•  ํ–‰๋™์„ ํ•˜์ง€ ์•Š์Œ โ†’ ์ž˜๋ชป๋œ ๊ฒฐ๋ก ๊ณผ ์ผ์น˜ํ•˜๋Š” ๊ด€์ธก๋งŒ ์ถ”๊ฐ€๋กœ ์ˆ˜์ง‘ โ†’ ๊ทธ ๊ฒฐ๋ก ์„ ์‹œ๊ฐ„์— ๊ฑธ์ณ ๊ฐ•ํ™”
    • e.g. Fig 1 API X๊ฐ€ 3๋ฒˆ timeout โ†’ โ€œAPI X failedโ€ ์ €์žฅ โ†’ ์ดํ›„ ์„ธ์…˜์—์„œ retry ์•ˆ ํ•จ โ†’ ์ผ์‹œ์  rate limiting์ด์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ์„ ์˜์˜ ๊ด€์ธก ๋ชป ํ•จ (= self-reinforcing error)
    • update-based ๋ฐฉ๋ฒ•๋„ ์—ฌ์ „ํžˆ ํ•œ๊ณ„: ๊ณ ์ณ๋ด์•ผ ๋˜ ๋‹ค๋ฅธ ๋‹จ์ผ ๊ฒฐ๋ก ์œผ๋กœ ๋Œ€์ฒด๋  ๋ฟ, ๋‹ค์Œ transient error ํ•œ ๋ฒˆ์ด๋ฉด ๋„๋กœ ์—ญ์ „๋  ๊ฒƒ
  • ๋ณต์› ๋ถˆ๊ฐ€๋Šฅ์„ฑ: point estimate๋กœ collapse๋œ ๋’ค์—๋Š” ํ๊ธฐ๋œ ๋Œ€์•ˆ์˜ posterior support๋ฅผ memory๋งŒ์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•  ์ˆ˜ ์—†๊ณ , ์™„์ „ํžˆ ์ƒˆ evidence๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ์Œ“์•„์•ผ ํ•จ
  • โ†’ ๋”ฐ๋ผ์„œ ํ•„์š”ํ•œ ์กฐ๊ฑด
    • [storage ๋‹จ๊ณ„] ํ›„๋ณด ๊ฒฐ๋ก ๊ณผ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ฒ„๋ฆฌ์ง€ ์•Š๊ณ  ๋ณด์กดํ•˜๋Š” ํ‘œํ˜„
    • [retrieval ๋‹จ๊ณ„] ๊ทธ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹ค์‹œ agent์—๊ฒŒ ๋…ธ์ถœ, ๋Œ€์•ˆ์„ ์˜์‚ฌ๊ฒฐ์ • ์‹œ์ ์— ๋ณด์ด๊ฒŒ ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜
    • [update rule] ์ƒˆ ๊ด€์ธก์— ๋Œ€ํ•œ ํ™•๋ฅ  ์ ์ง„์ ์œผ๋กœ ๊ฐฑ์‹  (๊ฐ•ํ•œ ๊ฒฐ๋ก ์€ ๊ฐ•ํ™”ํ•˜๊ณ  ์•ฝํ•œ ๊ฒฐ๋ก ์€ ์•ฝํ™”)

Suggestions

์ˆ˜์‹ ์ •์˜: belief state์™€ deterministic ํ•œ๊ณ„

  • POMDP ์„ค์ •: agent๋Š” ๋งค ์‹œ์  ๊ด€์ธก $o_t$๋ฅผ ๋ฐ›๊ณ  ํ–‰๋™ $a_t$๋ฅผ ๊ณ ๋ฅด์ง€๋งŒ ์„ธ๊ณ„์˜ ์ง„์งœ ์ƒํƒœ $s_t$๋Š” ์ง์ ‘ ๋ชป ๋ด„.
    • API X๊ฐ€ ์ฃฝ์—ˆ๋Š”์ง€ ์•„๋‹ˆ๋ฉด ์ผ์‹œ์ ์œผ๋กœ ๋ง‰ํ˜”๋Š”์ง€($s_t$)๋Š” hidden, ๊ด€์ธก๋˜๋Š” timeout($o_t$)์€ ๊ทธ ํ”์ ์ผ ๋ฟ
  • belief state $b_t$: ์ง€๊ธˆ๊นŒ์ง€์˜ ๊ด€์ธก/ํ–‰๋™ ๊ธฐ๋ก์œผ๋กœ ์กฐ๊ฑดํ™”ํ•œ hidden state์˜ posterior ๋ถ„ํฌ
    • ๋ณธ ๊ฑธ ์ข…ํ•ฉํ•˜๋ฉด ์ง„์งœ ์ƒํƒœ๊ฐ€ ๋ฌด์—‡์ผ ํ™•๋ฅ  ๊ณ„์‚ฐ โ†’ ์ตœ์  ํ–‰๋™์— ํ•„์š”ํ•œ ์ •๋ณด๋Š” ๋‹จ์ผ ๋ถ„ํฌ (sufficient statistic)
    • ๊ธฐ์กด external memory๋„ ๊ฒฐ๊ตญ ์ ‘๊ทผ ๋ถˆ๊ฐ€๋Šฅํ•œ $b_t$๋ฅผ ์••์ถ•ํ•ด ํ‰๋‚ด ๋‚ด๋Š” ์žฅ์น˜: ๊ด€์ธก์œผ๋กœ ์กฐํšŒ(Read)ํ•ด ํ–‰๋™์„ ๊ณ ๋ฅด๊ณ , ์ƒˆ ๊ด€์ธก์ด ์˜ค๋ฉด ๊ฐฑ์‹ (Update)
  • attribute $c$ = ์ถ”์ ํ•˜๋Š” ์‚ฌ์‹ค ๋‹จ์œ„ (user preference, tool status, object-location ๋“ฑ)
    • ๊ทธ attribute๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๋ก ๋“ค์ด ํ›„๋ณด ๊ฐ€์„ค
    • ํ›„๋ณด $h$์— ๋Œ€ํ•œ belief๋ฅผ $b^{(c)}_t(h)$๋กœ ํ‘œ๊ธฐ
  • deterministic memory๋Š” ํ™•๋ฅ  ์ตœ๋Œ“๊ฐ’ ํ›„๋ณด ํ•˜๋‚˜๋งŒ ์ €์žฅํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” drop, ideal memory๋Š” ํ›„๋ณด ๋ถ„ํฌ ์ „์ฒด(ํ•ฉ = 1) ์ €์žฅ
\[\text{deterministic: } \hat h_t(c) \in \arg\max_{h \in H(c)} b^{(c)}_t(h) \qquad \text{ideal: } b^{(c)}_t,\ \textstyle\sum_{h} b^{(c)}_t(h) = 1\]
  • e.g. failed 0.5 / rate-limit 0.35 / network 0.15
    • deterministic: โ€œfailedโ€๋งŒ ๋‚จ๊ธฐ๊ณ , ํ•œ ๋ฒˆ ๋ฒ„๋ฆฐ ๋ถˆํ™•์‹ค์„ฑ์€ memory๋งŒ์œผ๋กœ ๋ณต์› ๋ถˆ๊ฐ€
    • ideal: ์ „๋ถ€ ๋‚จ๊ธฐ๊ณ  ์ง€์†์ ์œผ๋กœ ๊ฐฑ์‹ 

Belief Memory: ๋ฌด์—‡์„ ์ €์žฅ? (Representation)

  • ideal ๋ถ„ํฌ์˜ ํ•œ๊ณ„ โ†’ ์ •ํ™•ํ•œ ๋ถ„ํฌ ๋Œ€์‹  ๊ทผ์‚ฌ belief๋ฅผ ์‚ฌ์šฉ: ์ €์žฅ๊ฐ’์€ ์—„๋ฐ€ํ•œ posterior๊ฐ€ ์•„๋‹ˆ๋ผ ranking/๊ฐฑ์‹ ์šฉ confidence score
    • ํ›„๋ณด๊ฐ€ ๋ฏธ๋ฆฌ ์ •ํ•ด์ง€์ง€ ์•Š๊ณ  ๋Œ€ํ™” ์ค‘ ์ƒˆ๋กœ ์ƒ๊ฒจ ์ •๊ทœํ™” ๋ถ„ํฌ ์ •์˜๊ฐ€ ์–ด๋ ค์›€
    • ๋งค ๊ด€์ธก๋งˆ๋‹ค ๋ชจ๋“  ํ›„๋ณด๋ฅผ ๊ฐฑ์‹ ํ•˜๋ฉด ๋น„์šฉ ์ธก๋ฉด ๋น„ํšจ์œจ
  • (๊ทผ์‚ฌ 1) $H_{\text{sub}}(c)$: ์ฆ๊ฑฐ๊ฐ€ ์‹ค์ œ๋กœ ๋ณธ ํ›„๋ณด๋งŒ ์ €์žฅ
    • ์ €์žฅ๋Ÿ‰์ด (์ „์ฒด ํ›„๋ณด ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ) ๋ณธ ์ฆ๊ฑฐ์— ๋น„๋ก€, ์•ˆ ๋ณธ ํ›„๋ณด๋Š” ๋น„์šฉ 0
    • ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ํ›„๋ณด๊ฐ€ ์•„๋‹ˆ๋ผ, ์ฆ๊ฑฐ๊ฐ€ ์‹ค์ œ๋กœ ํ•œ ๋ฒˆ์ด๋ผ๋„ ์ง€์ง€ํ•œ ํ›„๋ณด๋งŒ ์ €์žฅ
  • (๊ทผ์‚ฌ 2) confidence $p^{(c)}_t(h)$: ํ›„๋ณด๋ณ„ ๋…๋ฆฝ ์œ ์ง€, ํ•ฉ = 1๋กœ ์ •๊ทœํ™”ํ•˜์ง€ ์•Š์Œ
    • ๊ฐ ์ˆซ์ž๋Š” ํ›„๋ณด ํ•˜๋‚˜๋งŒ ๋†“๊ณ  ๋ณด๋ฉด ์ฆ๊ฑฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ์ง€์ง€ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ confidence

Belief-aware Memory Update: Add / Merge / Contradiction (Fig 2)

  • ์ƒˆ ๊ด€์ธก์ด ์˜ค๋ฉด LLM์ด ํ›„๋ณด ๊ฒฐ๋ก ๋“ค๊ณผ ๊ฐ confidence๋ฅผ ๋ฝ‘๊ณ , ์…‹ ์ค‘ ํ•˜๋‚˜๋กœ ์ฒ˜๋ฆฌ
  • Add (์ฒ˜์Œ ๋ณด๋Š” attribute): ์ƒˆ entry๋กœ ๋“ฑ๋ก, ์ดˆ๊ธฐ confidence๋Š” $[0.7, 0.9]$ ๊ตฌ๊ฐ„์—์„œ ์‹œ์ž‘ (0์ด๋‚˜ 1์ด ๋˜์ง€ ์•Š๋„๋ก)
  • Merge (์ด๋ฏธ ์žˆ๋Š” attribute๋ฅผ ์ง€์ง€ํ•˜๋Š” ๊ด€์ธก์ผ ๋•Œ): noisy-OR๋กœ ๊ธฐ์กด confidence๋ฅผ ๋Œ์–ด์˜ฌ๋ฆผ
\[p^{(c)}_{t+1}(h) = \min\!\left(1 - \big(1 - p^{(c)}_t(h)\big)\big(1 - \Delta(o_{t+1}, h)\big),\ 0.99\right)\]
  • ์ง€์ง€ ์ฆ๊ฑฐ๊ฐ€ ๋“ค์–ด์˜ฌ์ˆ˜๋ก confidence๊ฐ€ ์˜ฌ๋ผ๊ฐ€๊ธฐ๋งŒ ํ•˜๊ณ (๋‹จ์กฐ ์ฆ๊ฐ€) 0.99์—์„œ ๋ฉˆ์ถค
  • $\Delta$: ์ƒˆ ๊ด€์ธก์ด $h$๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ์ง€์ง€ํ•˜๋Š”์ง€
    • LLM์ด ๋งค๊ธด confidence์ด์ง€ ๋ณด์ •๋œ likelihood๊ฐ€ ์•„๋‹˜
  • ๊ฐฑ์‹  ์ „ ์˜› ๋ฒ„์ „์€ archive: โ€œ์ด์ „ ์„ธ์…˜์—” ์–ด๋–ป๊ฒŒ ์•Œ๊ณ  ์žˆ์—ˆ์ง€โ€ ๊ฐ™์€ temporal query ๋Œ€์‘ (Appendix B.3)
  • Contradiction (๊ฐ™์€ attribute์—์„œ ๋ฐ˜๋Œ€ ๊ฒฐ๋ก ์„ ์ง€์ง€ํ•˜๋Š” ๊ด€์ธก): ๊ธฐ์กด $h$์˜ confidence๋ฅผ 0.25๋กœ ๊ฐ•๋“ฑ
    • ๋ชจ์ˆœ ํŒ์ •์€ rule-based ํ‚ค์›Œ๋“œ ๋งค์นญ, ์ด์ „ ๊ฐ’์€ ๋ณด์กด
    • noisy-OR์ด ์œ„๋กœ๋งŒ ์›€์ง์ด๋ฏ€๋กœ ํ•˜ํ–ฅ์€ ์ด ๊ทœ์น™์ด ์ „๋‹ดํ•˜๋Š” ๋น„๋Œ€์นญ ๊ตฌ์กฐ

Belief-aware Retrieval

  • retrieval ์„ค๊ณ„: ํ›„๋ณด ๋ถ„ํฌ๋ฅผ ๋ณด์กดํ•˜๋Š” storage
    • storage์—์„œ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ณด์กดํ•ด๋„ retrieval์—์„œ ๊ฒฐ๋ก  ํ•˜๋‚˜๋กœ ์••์ถ•ํ•˜๋ฉด ์˜๋ฏธ ์—†์œผ๋ฏ€๋กœ, retrieval๋„ ๋ถ„ํฌ ๋‹จ์œ„๋กœ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ๋„๋ก
    • entry๋ณ„ ์ ์ˆ˜:
\[\alpha_t(c) = \text{sim}(o_t, c) \cdot \lambda^{\tau_t(c)}, \quad \lambda \in (0, 1]\]
  • (์ง€๊ธˆ ๊ด€์ธก๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ ์žˆ๋‚˜) ร— (์ตœ๊ทผ์— ๊ฐฑ์‹ ๋๋‚˜): $\text{sim}$์€ embedding 0.7 + lexical 0.3 hybrid, $\lambda^{\tau}$๋Š” ์˜ค๋ž˜ ์•ˆ ๊ฑด๋“œ๋ฆฐ entry์ผ์ˆ˜๋ก ์ ์ˆ˜๋ฅผ ๊นŽ๋„๋ก decay
  • staleness๋Š” ๊ฒ€์ƒ‰ ์šฐ์„ ์ˆœ์œ„๋งŒ ๋‚ฎ์ถ”๊ณ  ๊ทธ ์•ˆ์˜ confidence ๊ฐ’ ์ž์ฒด๋Š” ์•ˆ ๊ฑด๋“œ๋ฆผ (recency โ‰  belief)
  • Read: ์ ์ˆ˜ ์ƒ์œ„ K๊ฐœ๋ฅผ ๊ณจ๋ผ ๊ฐ attribute์˜ ํ›„๋ณด ๋ถ„ํฌ๋ฅผ ํ†ต์งธ๋กœ output
    • agent๋Š” ๊ฒฐ์ • ์ง์ „์— failed 0.1 / rate-limit 0.8 ๊ฐ™์€ ๋ถ„ํฌ๋กœ ๋œ ๋Œ€์•ˆ ์ „์ฒด๋ฅผ ๋ด„
    • deterministic์ฒ˜๋Ÿผ ์ €์žฅ ์‹œ์ ์— ์ง€์›Œ์ง€์ง€ ์•Š์œผ๋‹ˆ retry ๊ฐ™์€ ๋Œ€์•ˆ ํ–‰๋™์„ ๋‹ค์‹œ ์‹œ๋„ ๊ฐ€๋Šฅ

Effects

  • Experimental setup
    • benchmark
      • LoCoMo (Maharana et al., 2024)
        • long-term ๋Œ€ํ™” memory, ํ‰๊ท  ~9,000 token, ์ตœ๋Œ€ 35 session
        • 4 category: single-hop / multi-hop / temporal reasoning / open-domain
        • metric: F1, BLEU-1
      • ALFWorld (Shridhar et al., 2020)
        • text ๊ธฐ๋ฐ˜ embodied, 6๊ฐœ household goal
        • in-distribution Seen(140 ep) / out-of-distribution Unseen(134 ep) ๋ถ„๋ฆฌ โ†’ Unseen์ด memory transfer ์ง์ ‘ ์ธก์ •
        • metric: success rate(SRโ†‘), solved episode ํ‰๊ท  step(โ†“), horizon 50-step
    • baseline
      • LoCoMo: LoCoMo baseline, ReadAgent, MemoryBank, MemGPT, A-MEM, Mem0
      • ALFWorld: + LangMem, MemoryOS, No-Memory(ํ˜„์žฌ ๊ด€์ธก๋งŒ์œผ๋กœ ํ–‰๋™)
    • evaluation
      • LoCoMo: embedding text-embedding-3-small, base = GPT-4o / GPT-4o-mini
      • ALFWorld: base = Qwen3-Next-80B-A3B-Instruct, Contriever retrieval, memory bank 3,000 expert trajectory (eval trace ๋ฏธ์‚ฌ์šฉ)
      • ๊ณตํ†ต
        • $[p_{\min}, p_{\max}] = [0.7, 0.9]$
        • decay $\lambda$ = 0.5(LoCoMo) / 0.1(ALFWorld)
        • retrieval Top-K = 20 (LoCoMo multi-hop/temporal/open-domain์€ 30)
        • attribute๋‹น ์ตœ๋Œ€ ํ›„๋ณด 4๊ฐœ
  • Results
    • LoCoMo Tab 1: BeliefMem์ด GPT-4o-mini / GPT-4o ๊ธฐ์ค€ ๋ชจ๋‘ ํ‰๊ท  ์ตœ๊ณ 
      • temporal 51.88 / 45.78, multi-hop 40.51 / 32.24์—์„œ ํฐ ํญ ์šฐ์œ„ โ†’ ๊ด€์ธก ์ถฉ๋Œ ํ•ด์†Œ ๋ฐ evidence ์ง‘๊ณ„๊ฐ€ ํ•„์š”ํ•œ task์— ๊ฐ•ํ•˜๋‹ค๊ณ  ์ฃผ์žฅ
        • temporal ๊ฐ•์„ธ๋Š” historical version archiving(๊ณผ๊ฑฐ ์ƒํƒœ๋ฅผ timestamp๋กœ ๋ณด์กด)์—์„œ ์ง์ ‘ ๊ธฐ์ธํ•œ๋‹ค๊ณ  ์ฃผ์žฅ
    • ALFWorld Tab 2: Seen/Unseen ์ „๋ฐ˜์—์„œ ๋ชจ๋“  baseline ์ƒํšŒ
      • 2์œ„(ReadAgent) ๋Œ€๋น„ +11%, ๋‚˜๋จธ์ง€ baseline ํ‰๊ท  ๋Œ€๋น„ +99% (๋‹จ ์ด ํ‰๊ท ์—๋Š” No-MemoryยทMemoryOS ๊ฐ™์€ ์•ฝํ•œ baseline ํฌํ•จ)
      • BeliefMem (corpus 50%๋งŒ ์‚ฌ์šฉ): Unseen SR ์ตœ๊ณ  ์„ฑ๋Šฅ, ํ‰๊ท  SR 59.88 โ†’ full corpus(58.66)๋ณด๋‹ค Unseen์—์„œ ์˜คํžˆ๋ ค ๋” ์ข‹์Œ
      • trade-off: full corpus๋Š” Seen ์ตœ๊ณ (63.57), 50%๋Š” Unseen ์ผ๋ฐ˜ํ™” ์ตœ๊ณ  โ†’ in-distribution memory๊ฐ€ ๋งŽ์•„์ง€๋ฉด seen trajectory ์•”๊ธฐ๋กœ ํŽธํ–ฅ
        • ์ €์ž๋Š” โ€œplausibleํ•˜์ง€๋งŒ ๊ฒฐ์ •์ ์ด์ง€ ์•Š๋‹คโ€๊ณ  ํ‘œํ˜„
    • Ablation Tab 3: ๊ฐ ๊ตฌ์„ฑ์š”์†Œ ๋ชจ๋‘ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒฐ๋ก 
      • w/o belief-based memory(deterministic๋กœ collapse): LoCoMo F1 42.38 โ†’ 22.58, ALFWorld SR 59.88 โ†’ 28.71 (๊ฐ€์žฅ ํฐ ํ•˜๋ฝ)
      • w/o belief-aware retrieval: ํ›„๋ณด๋Š” ๋‘๋˜ ํ™•๋ฅ  ํ๊ธฐ
        • ๋‹ค์ค‘ ๊ฐ€์„ค ์œ ์ง€๋งŒ์œผ๋กœ๋„ ์ผ๋ถ€ ํšจ๊ณผ๋Š” ์žˆ์œผ๋‚˜, multi-hop/open-domain ๊ฐ™์€ ์ถฉ๋Œ task์—์„œ ๊ธ‰๋ฝ
      • w/o Add: ์‹ ๊ทœ attribute ํŽธ์ž… ๋ถˆ๊ฐ€ โ†’ ์‚ฌ์‹ค์ƒ ๋ถ•๊ดด
      • w/o Merge: ํ™•๋ฅ ์ด ์ดˆ๊ธฐ๊ฐ’์— frozen โ†’ ์ •์  memory
    • ๋ฐ์ดํ„ฐ ํšจ์œจ Fig 4a: corpus 500~3,000์—์„œ robustํ•˜๊ณ  corpus 50%๋งŒ์œผ๋กœ ์ „ baseline ๋Šฅ๊ฐ€
      • 500๊ฐœ(16.67%)๋กœ๋„ 6๊ฐœ ์ค‘ 5๊ฐœ ๋Šฅ๊ฐ€
    • belief ์ˆ˜๋ ด Fig 4b: Top-1 rate(true ๊ฒฐ๋ก ์ด ์ตœ๊ณ  confidence๋ฅผ ๋ฐ›๋Š” ๋น„์œจ)๊ฐ€ evidence ๋ˆ„์ ์— ๋”ฐ๋ผ ์•ฝ 88%๊นŒ์ง€ ์ƒ์Šน
      • raw frequency ๊ธฐ๋ฐ˜ baseline์€ noise์— ๋ถ„ํฌ๊ฐ€ ์™œ๊ณก๋˜์–ด ์ˆ˜๋ ด ์‹คํŒจ
    • adversarial correction Fig 3: ๊ฐ•ํ•˜๊ฒŒ ํ‹€๋ฆฐ ๊ฒฐ๋ก ์„ ์ฃผ์ž…(102 sample) ํ›„ valid/noisy ๊ด€์ธก์„ ์„ž์–ด ๊ต์ •
      • correction rate์ด deterministic ๋Œ€๋น„ ๊ฑฐ์˜ 2๋ฐฐ, ํ‰๊ท  ๊ต์ • step๋„ ์•ฝ 2๋ฐฐ ์ •๋„ ๋” ๋น ๋ฆ„
    • token ๋น„์šฉ Fig 5: LoCoMo์—์„œ generation๋‹น ํ‰๊ท  1,414 token (Mem0 ์•ฝ 2K / A-MEM 1.7K)์œผ๋กœ ์˜คํžˆ๋ ค ์ ์Œ

Personal note. ์—ฐ๊ตฌ๋ฏธํŒ…์—์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ๊ฒƒ ๊ฐ™์€๋ฐ, memory๋ฅผ distribution์ฒ˜๋Ÿผ ๊ฐ€์ ธ๊ฐ€๋Š” ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์ธ ์„ฑ๋Šฅ์€ ์ข‹์•˜๋Š”๋ฐ, ์ด๋ก ์ ์œผ๋กœ ์—„๋ฐ€ํ•˜์ง€๋Š” ์•Š์•˜๋‹ค๊ณ  ์–ธ๊ธ‰๋˜๊ธด ํ•˜๋Š”๋ฐ ๊ทธ๋ž˜์„œ ๊ทธ๋Ÿฐ์ง€ POMDP๊ฐ€ ๋ฉ”์„œ๋“œ๋ฅผ ์œ ๋„ํ–ˆ๋‹ค๊ธฐ๋ณด๋‹ค ์‚ฌํ›„์ ์œผ๋กœ ์ •๋‹นํ™”ํ•˜๋Š” ์ˆ˜๋‹จ์œผ๋กœ ์ฝํžˆ๋Š” ๋ถ€๋ถ„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ๋Š”์ง€ ์‚ดํŽด๋ณด๋ ค ํ•ฉ๋‹ˆ๋‹ค.