6 minute read

Meta info.
  • Authors: Shiping Zhu, Yibo Yang, Zhengyang Wang, Tiancheng Shen, Dandan Guo, Ming-Hsuan Yang
  • Affiliation: Jilin University, Shanghai Jiao Tong University, University of California at Merced
  • Paper: https://arxiv.org/abs/2606.09461
  • Published: June 9, 2026 (arXiv preprint)
Slide 1 Slide 2
Slide 3 Slide 4
Slide 5 Slide 6
Slide 7 Slide 8

Slide 9

TL; DR

agent๊ฐ€ participant๊ฐ€ ์•„๋‹Œ observer๋กœ์„œ human-human ๋Œ€ํ™”๋ฅผ ๊ด€์ฐฐ+๊ธฐ์–ตํ•˜๋Š” ์ƒํ™ฉ์„ ํ‰๊ฐ€ํ•˜๋Š” multimodal, multi-session, multi-party ๋ฉ”๋ชจ๋ฆฌ ๋ฒค์น˜๋งˆํฌ H2HMem ์ œ์•ˆ

Review Video

Human-Assistant vs Human-Human Interaction (Fig 1) Benchmark Comparison (Tab 1) Dataset Construction Pipeline (Fig 2) Dataset Statistics (Tab 2) Task Taxonomy and QA Distribution (Fig 3) Main LLM-Judge Results (Tab 3) Lexical Metric Results (Tab 4) Efficiency Comparison (Tab 5) Error Type Distribution (Tab 6) Case Studies (Fig 4)

Background

  • LLM agent์˜ ์ƒˆ๋กœ์šด ๋ฐฐ์น˜ ํ™˜๊ฒฝ: human-assistant ๋Œ€ํ™”์˜ participant๊ฐ€ ์•„๋‹ˆ๋ผ human-human ๋Œ€ํ™”์˜ observer
    • meeting assistant, clinical documentation (ambient AI scribe), Zoom AI Companion ๋“ฑ ์‹ค์‘์šฉ์—์„œ agent๋Š” ์ œ3์ž๋กœ ๋Œ€ํ™”๋ฅผ ๋“ฃ๊ณ  ์ดํ›„ ์งˆ์˜์— ๋‹ตํ•จ (Asthana et al., 2025; Razaghi et al., 2026)
    • ์‚ฌ๋žŒ๋“ค ๊ฐ„ ์ •๋ณด ๋ถ„์‚ฐ์„ ์ถ”์ ํ•˜๊ณ , ๊ธด ์‹œ๊ฐ„ ๋ฒ”์œ„์—์„œ ๋งฅ๋ฝ์„ ์œ ์ง€ํ•˜๊ณ , modality ๊ฐ„ ์‹ ํ˜ธ๋ฅผ ํ†ตํ•ฉํ•ด์•ผ ํ•จ
  • observer setting ๊ณ ์œ ์˜ ์„ธ ๊ฐ€์ง€ ๋‚œ์  Fig 1
    • multimodal: ์‚ฌ๋žŒ๋“ค๋ผ๋ฆฌ ์‚ฌ์ง„, ์Šคํฌ๋ฆฐ์ƒท์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ณต์œ  (Lee et al., 2024)
    • discourse ํ˜„์ƒ: anaphora, deixis ํ•ด์†Œ๋ฅผ ์œ„ํ•ด isolated fact retrieval์ด ์•„๋‹Œ evolving memory์— ๋Œ€ํ•œ reference resolution ํ•„์š”
    • multi-participant: ์—ฌ๋Ÿฌ ํ™”์ž๊ฐ€ ๋น„๋™๊ธฐ์ ์œผ๋กœ, ๋•Œ๋กœ๋Š” ์ƒ์ถฉํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ธฐ์—ฌ (Abbo et al., 2025)
  • ๊ธฐ์กด ๋ฉ”๋ชจ๋ฆฌ ๋ฒค์น˜๋งˆํฌ์˜ ๊ณต๋ฐฑ Tab 1
    • ๋Œ€๋ถ€๋ถ„ single-user, text-only, human-assistant: LongMemEval, PersonaMem, MemoryAgentBench (Wu et al., 2025; Jiang et al., 2025; Hu et al., 2026)
    • LoCoMo๋Š” vision์„ ํฌํ•จํ•˜์ง€๋งŒ dyadic ํ•œ์ • (Maharana et al., 2024); EverMemBench๋Š” multi-party์ง€๋งŒ text-only (Hu et al., 2026)
    • multimodality, dyadic & multi-party, long-horizon์„ ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์—์„œ ๋™์‹œ์— ๋‹ค๋ฃจ๋Š” ๋ฒค์น˜๋งˆํฌ ๋ถ€์žฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ฐ๊ฐ์˜ ํ•œ๊ณ„: ์•„๋ž˜ ์„ธ๊ฐ€์ง€ ๊ฐˆ๋ž˜ ๋ชจ๋‘ human-assistant setting์—์„œ ๊ฐœ๋ฐœ/ํ‰๊ฐ€ โ†’ multimodal human-human ํ™˜๊ฒฝ์—์„œ์˜ ์œ ํšจ์„ฑ ๋ฏธ๊ฒ€์ฆ
    • context window ํ™•์žฅ: ๋‹จ์ˆœํ•˜์ง€๋งŒ ๋น„์šฉ ํฌ๊ณ  long-context degradation ๋ฐœ์ƒ, cross-session persistence ์—†์Œ
    • retrieval-augmented memory: ํ™•์žฅ์„ฑ ์žˆ์œผ๋‚˜ factual recall ์œ„์ฃผ, episodic dependency์™€ causal structure์— ์ทจ์•ฝ (Lewis et al., 2020)
    • ๋ช…์‹œ์  memory module (write/index/summarize/forget): A-Mem, MemoryOS (Xu et al., 2026; Kang et al., 2025)

Problem States

multimodal human-human ๋Œ€ํ™”๋ฅผ ๊ด€์ฐฐํ•˜๋Š” agent์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€

  • ์‹ค์ œ human-human ๋Œ€ํ™” ์ˆ˜์ง‘์€ privacy ์œ„ํ—˜์ด ์ปค์„œ de-identification์œผ๋กœ๋„ ์™„์ „ ํ•ด์†Œ๊ฐ€ ์–ด๋ ค์›€ โ†’ privacy-preserving ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ ํ•„์š”
  • ๋‹จ์ˆœ recall ์ธก์ •๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑ โ†’ discourse ํ•ด์†Œ, ์ธ๊ณผ ์ถ”๋ก , conflict ์ฒ˜๋ฆฌ๊นŒ์ง€ ํฌ๊ด„ํ•˜๋Š” ํ‰๊ฐ€ taxonomy ํ•„์š”
  • dyadic๊ณผ multi-party๋Š” ์ •๋ณด ๋ถ„์‚ฐ ๊ตฌ์กฐ๊ฐ€ ๋‹ค๋ฆ„ โ†’ ๋‘ interaction type์„ ๊ฐ™์€ ํ”„๋ ˆ์ž„์—์„œ ๋น„๊ต ๊ฐ€๋Šฅํ•ด์•ผ ํ•จ

Suggestions

์ˆ˜์‹ ์ •์˜

  • dialogue: $S = (s_1, \dots, s_T)$, ๊ฐ session $s_t$๋Š” timestamp $\tau_t$๊ฐ€ ๋ถ™์€ ํ•˜๋ฃจ์น˜ ๋Œ€ํ™” (ํ•˜๋‚˜์˜ ํ† ํ”ฝ ์ค‘์‹ฌ)
  • utterance: multimodal tuple $u_{t,i} = (p_{t,i}, x_{t,i}, v_{t,i})$
    • $p_{t,i} \in \mathcal{P}$: ํ™”์ž, $x_{t,i}$: ํ…์ŠคํŠธ, $v_{t,i}$: optional ์ด๋ฏธ์ง€
    • $\lvert \mathcal{P} \rvert = 2$๋ฉด dyadic, $\lvert \mathcal{P} \rvert \geq 3$์ด๋ฉด multi-party
  • memory์™€ ์งˆ์˜ ์‘๋‹ต: $\mathcal{M}_T = {m_1, \dots, m_N}$, $\mathcal{R} = \mathrm{retrieve}(q, \mathcal{M}_T)$, $a = \mathrm{LLM}(\mathcal{R}, q)$
    • storage-retrieve-answer์˜ ํ‘œ์ค€ ์™ธ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ์ถ”์ƒํ™”; ๋ฒค์น˜๋งˆํฌ๋Š” ์ด ์ถ”์ƒํ™”๋ฅผ ๋”ฐ๋ฅด๋Š” ์–ด๋–ค ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์ด๋“  plug-in ํ‰๊ฐ€ ๊ฐ€๋Šฅ

๋ฐ์ดํ„ฐ ๊ตฌ์ถ•: human-in-the-loop 5-stage (Fig 2)

  • ์—ญํ•  ๋ถ„๋ฆฌ:
    • ์ธ๊ฐ„: director (์‹œ๋‚˜๋ฆฌ์˜ค ์ผ๊ด€์„ฑ, visual grounding, ํ’ˆ์งˆ ๊ด€๋ฆฌ)
    • LLM: scriptwriter (๋Œ€ํ™”, ์‹œ๋‚˜๋ฆฌ์˜ค, QA ์ƒ์„ฑ)
  • online conversational setting ์ฑ„ํƒ: ๋ฉ”์‹œ์ง• ํ”Œ๋žซํผ์‹ ๋น„๋™๊ธฐ ๋ฉ”์‹œ์ง€ ๊ตํ™˜ ํ™˜๊ฒฝ
    • ecological validity, ๊ตฌ์กฐํ™”๋œ ์ •๋ณด ํ๋ฆ„, ๋‹ค์–‘ํ•œ ํ† ํ”ฝ๊ณผ ๋‹ค์–‘ํ•œ ์ฐธ์—ฌ์ž ์ˆ˜์šฉ์˜ ์ ˆ์ถฉ์ 
  • [Stage 1] participant profile ์ƒ์„ฑ: ๊ตฌ์กฐํ™”๋œ schema (์„ฑ๊ฒฉ, ๋ฐฐ๊ฒฝ, ๋งํˆฌ ๋“ฑ) ๊ธฐ๋ฐ˜
    • DeepSeek-V3๊ฐ€ profile ์ƒ์„ฑ
    • dyadic์€ 2์ธ, multi-party๋Š” 4~6์ธ profile
  • [Stage 2] scenario ๊ตฌ์„ฑ: 11๊ฐœ ๊ณตํ†ต ํ† ํ”ฝ์—์„œ sampling
    • ํ† ํ”ฝ๋ณ„ session-level outline๊ณผ image keyword ์ƒ์„ฑ
    • outline์ด temporally ordered๋˜์–ด multi-session scenario $S$๋ฅผ ํ˜•์„ฑ
  • [Stage 3] ์ด๋ฏธ์ง€ ์ˆ˜์ง‘๊ณผ human refinement: ์›น ๊ฒ€์ƒ‰ + text-to-image ์ƒ์„ฑ + ์ˆ˜์ž‘์—… ํŽธ์ง‘
    • annotator๊ฐ€ utterance-์ด๋ฏธ์ง€ ์ •ํ•ฉ, ํ•ด์ƒ๋„, ํ† ํ”ฝ ์ ํ•ฉ์„ฑ ๊ธฐ์ค€์œผ๋กœ filter/refine (80 person-hours)
  • [Stage 4] captioning๊ณผ ๋Œ€ํ™” ์ƒ์„ฑ
    • GPT-4o๊ฐ€ caption ์ƒ์„ฑ, DeepSeek-V3๊ฐ€ profile + outline + caption ์กฐ๊ฑด๋ถ€๋กœ ๋Œ€ํ™” ์ƒ์„ฑ
    • DeepSeek-V3๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๋ฏ€๋กœ caption์ด ๋งค๊ฐœ
  • [Stage 5] QA ๊ตฌ์„ฑ๊ณผ ๊ฒ€์ฆ
    • DeepSeek-V3๊ฐ€ 9๊ฐœ task type๋ณ„ QA ์ƒ์„ฑ
    • human annotator๊ฐ€ ์œ ์ผ์„ฑ/๋ช…ํ™•์„ฑ/๋‚œ์ด๋„ ๊ฒ€์ฆ (40 person-hours)
    • inter-annotator agreement: ์ด๋ฏธ์ง€ refinement Fleiss $\kappa = 0.83$, QA ๊ฒ€์ฆ $\kappa = 0.79$
  • ์ตœ์ข… ๊ทœ๋ชจ Tab 2
    • 25 dialogues (dyadic 20 / multi-party 5), 309 sessions, 7,078 rounds, 1,300 images, 2,236 QA pairs
    • dyadic: ํ‰๊ท  14.2 sessions, session๋‹น 18.7 rounds๋กœ ๊ธด horizon, ๋‚ฎ์€ ๋ฐ€๋„
    • multi-party: ํ‰๊ท  5.0 sessions, session๋‹น 70.5 rounds๋กœ ์งง์€ horizon, ๋†’์€ ๋ฐ€๋„

Task taxonomy: 3๊ฐœ ๋ฒ”์ฃผ 9๊ฐœ task (Fig 3)

  • Memory Recall: ๋ช…์‹œ์ ์œผ๋กœ ์ œ์‹œ๋œ multimodal ์ •๋ณด์˜ retrieval ํ‰๊ฐ€
    • Unimodal Precise Recall (UPR): ๋‹จ์ผ modality์—์„œ ์ •ํ™•ํ•œ ์‚ฌ์‹ค ํšŒ์ˆ˜
    • Cross-modal Related Retrieval (CRR): text์™€ image ๊ฐ„ ์ •๋ ฌ๋œ ๋‚ด์šฉ์˜ cross-modal ๊ฒ€์ƒ‰
    • Knowledge Resolution (KR): session ๊ฐ„ ๊ฐฑ์‹ ๋œ ์ง€์‹์—์„œ ํ˜„์žฌ ์‹œ์ ์˜ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’ ํ•ด์†Œ
  • Memory Reasoning: ์‹œ๊ฐ„๊ณผ ์ฐธ์—ฌ์ž๋ฅผ ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ƒ์œ„ ์ถ”๋ก  ํ‰๊ฐ€
    • Multimodal Causal Reasoning (MCR): text์™€ image ์‚ฌ์ด์˜ ์ธ๊ณผ ๊ด€๊ณ„ ์ถ”๋ก 
    • Reference & Evolution Tracking (RET): anaphora/deixis ํ•ด์†Œ์™€ entity ๋ณ€ํ™” ์ถ”์ 
    • Temporal Reasoning (TR): timestamp์™€ ๋ฐœํ™” ์œ„์น˜ ๊ธฐ๋ฐ˜ ์‚ฌ๊ฑด ์ˆœ์„œ ์ถ”๋ก 
  • Memory Application: ๋™์  ์ƒํ™ฉ์—์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ ํ‰๊ฐ€
    • Test-Time Learning (TTL): ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ƒˆ ์‹œ๋‚˜๋ฆฌ์˜ค์— ์ ์šฉ
    • Conflict Detection (CD): ์ƒˆ ์ง„์ˆ ์ด ๋ฉ”๋ชจ๋ฆฌ์™€ ๋ชจ์ˆœ๋˜๋Š”์ง€ binary ํŒ๋‹จ
    • Answer Refusal (AR): ๋ฉ”๋ชจ๋ฆฌ์— ์—†๊ฑฐ๋‚˜ ์ถ”๋ก  ๋ถˆ๊ฐ€ํ•œ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ ๊ฑฐ๋ถ€
  • QA ๋ถ„ํฌ: ์„ธ ๋ฒ”์ฃผ๊ฐ€ ๊ฐ 32~33%๋กœ ๊ท ๋“ฑํ•˜๋‚˜, KR (44๊ฐœ, 2.0%)๊ณผ TR (45๊ฐœ, 2.0%)์€ ํ‘œ๋ณธ์ด ์ž‘์Œ

Effects

  • Experimental setup
    • ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ๋ฒ• 6์ข…์„ ๋‘ ๊ณ„์—ด๋กœ ๋น„๊ต
      • text-based: Full Memory (Text), NaiveRAG, A-Mem
        • ์ด๋ฏธ์ง€๋Š” GPT-4o caption (256 token ์ œํ•œ)์œผ๋กœ ๋ณ€ํ™˜ํ•ด ์ €์žฅ
      • multimodal: Full Memory (MM), MuRAG, NGM
        • raw image๋ฅผ 224ร—224๋กœ ์ €์žฅ ๋ฐ ๊ฒ€์ƒ‰
    • backbone MLLM 3์ข…: Qwen2.5-VL-3B/7B-Instruct, GPT-4.1-Nano (temperature 0.1)
    • retriever: text์šฉ all-MiniLM-L6-v2, multimodal์šฉ GME-Qwen2-VL-7B-Instruct, ๊ธฐ๋ณธ top-K=5
    • ํ‰๊ฐ€: GPT-4o-mini LLM-as-Judge (0/0.25/0.5/0.75/1 rubric, 200-sample human agreement Cohenโ€™s $\kappa = 0.84$) + lexical metric (Precision/Recall/F1/BLEU-1)
  • Results
    • ์ „๋ฐ˜์ ์œผ๋กœ ๋‚ฎ์€ ์„ฑ๋Šฅ: ์ตœ๊ณ ๊ฐ€ A-Mem์˜ weighted average 0.5757 (Tab 3); backbone์„ ๋ฐ”๊ฟ”๋„ 0.6์„ ๋„˜๋Š” ๋ฐฉ๋ฒ• ์—†์Œ
    • Tab 3,4
      • cross-modal alignment: UPR ๋Œ€๋น„ CRR์ด ์ผ๊ด€๋˜๊ฒŒ ๋‚ฎ์Œ
        • MuRAG ๊ธฐ์ค€ 0.6346 โ†’ 0.5326, lexical recall๋„ 0.4063 โ†’ 0.3120
      • distractor filtering ์•ฝํ•จ: recall์€ ๋†’์œผ๋‚˜ precision์ด ๋‚ฎ์Œ
        • A-Mem recall 0.4215 vs precision 0.2206: ๊ด€๋ จ history๋Š” ์ฐพ์ง€๋งŒ multi-participant noise๋ฅผ ๊ฑฐ๋ฅด์ง€ ๋ชปํ•จ
      • reasoning task ์ตœ์ €์ : MCR, RET๊ฐ€ ๋ชจ๋“  ๋ฐฉ๋ฒ•์—์„œ ๊ฐ€์žฅ ๋‚ฎ๊ณ , BLEU-1 near-zero
        • ๋ถ„์‚ฐ๋œ ์ฆ๊ฑฐ๋ฅผ ์ž‡๋Š” ์ •ํ™•ํ•œ factual phrasing์„ ์žฌํ˜„ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, ์ธ๊ฐ„์˜ implicit reference ๊ด€ํ–‰์— ์ทจ์•ฝ
      • conflict ์ฒ˜๋ฆฌ ๋ถ•๊ดด: CD๊ฐ€ lexical ๊ธฐ์ค€ near-zero (A-Mem CD recall 0.0869)
    • interaction ๊ตฌ์กฐ ํšจ๊ณผ: dyadic vs multi-party๊ฐ€ task๋ณ„๋กœ ์—ญ์ „ (Tab 3)
      • ์ผ๊ด€์„ฑ ์ง€ํ–ฅ task (KR, CD)๋Š” ๋‹ค์ค‘ ํ™”์ž์˜ ์ƒ์ถฉ ์‹ ํ˜ธ๋กœ multi-party์—์„œ ๊ธ‰๋ฝ
        • NaiveRAG KR 0.4896 (dyadic) โ†’ 0.2500 (multi-party)
      • ์ง‘์ค‘๋œ ๋งฅ๋ฝ์ด ์œ ๋ฆฌํ•œ task (CRR, TTL)๋Š” multi-party์—์„œ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋†’์Œ
      • Tab 10, 11 3B โ†’ 7B scaling์œผ๋กœ๋„ ์ด gap์ด ํ•ด์†Œ๋˜์ง€ ์•Š์Œ
        • CRR, MCR, CD์˜ ๊ฐœ์„ ํญ ์ตœ์†Œ
    • Tab 5 ํšจ์œจ trade-off: storage์™€ inference latency์˜ ๋ช…ํ™•ํ•œ ๊ตํ™˜ ๊ด€๊ณ„
      • full memory ๊ณ„์—ด: ์ €์žฅ ๋น„์šฉ ์ตœ์†Œ์ด๋‚˜ latency ํผ (Full Text 17.99 s/q, Full MM 26.09 s/q)
      • A-Mem: latency 4.57 s/q๋กœ ๋น ๋ฅด์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์ถ•์— 351.08 s/session
    • Tab 6 error ๋ถ„์„: ์‹คํŒจ 100๊ฑด ์ˆ˜๋™ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ ๋‘ archetype์— ์ง‘์ค‘
      • modal misalignment 44~46%: ํ…์ŠคํŠธ๋ฅผ visual evidence์— groundingํ•˜์ง€ ๋ชปํ•จ
      • speaker-related error 32~35%: ํ™”์ž ์˜ค๊ท€์†, ์ธ๊ฐ„ referential ๊ด€ํ–‰ ์ถ”์  ์‹คํŒจ
      • Fig 4 case study: recipe ์ด๋ฏธ์ง€์˜ ์žฌ๋ฃŒ ์‹๋ณ„์—์„œ ์ด๋ฏธ์ง€ ๋ฌด์‹œ(NGM)์™€ ํ™”์ž ์˜ค๊ท€์†(Full MM), ๋ฉ”๋‰ด ๊ธฐ๋ฐ˜ ๊ฒฐ๋ก  ์ถ”๋ก ์—์„œ ๊ฒฐ๋ก ์˜ ํ™”์ž ์˜ค๊ท€์†(NGM)์ด ๊ทธ๋Œ€๋กœ ์žฌํ˜„
    • retriever top-K ๋ถ„์„ (Tab 14): K ์ฆ๊ฐ€๊ฐ€ ๋น„๋‹จ์กฐ์  ํšจ๊ณผ
      • A-Mem, NGM์€ K=15์—์„œ ์ •์  ํ›„ ํ•˜๋ฝ; MuRAG๋Š” K=10 ์ •์  โ†’ recall ํ–ฅ์ƒ๊ณผ noise ๋ˆ„์ ์˜ trade-off <!โ€“
  • Limitations
    • ๋Œ€ํ™”์™€ QA๊ฐ€ ๋ชจ๋‘ LLM-generated (DeepSeek-V3): ํ•ฉ์„ฑ ๋ถ„ํฌ๊ฐ€ ์‹ค์ œ human-human ๋Œ€ํ™”์˜ discourse ํŠน์„ฑ์„ ์–ผ๋งˆ๋‚˜ ๋ณด์กดํ•˜๋Š”์ง€ ๋ฏธ๊ฒ€์ฆ
    • ์ƒ์„ฑ๊ณผ ํ‰๊ฐ€์˜ ์ˆœํ™˜: ๋Œ€ํ™” ์ƒ์„ฑ, caption, QA ์ƒ์„ฑ, judge๊ฐ€ ์ „๋ถ€ LLM์ด๊ณ  human ๊ฒ€์ฆ์€ ์‚ฌํ›„ filter ์ˆ˜์ค€
    • multi-party ํ‘œ๋ณธ ๋นˆ์•ฝ: 5 dialogues, 190 QA๋กœ multi-party ๊ฒฐ๊ณผ์˜ ๋ถ„์‚ฐ์ด ํผ (AR 1.0000 ๊ฐ™์€ ํฌํ™” ์ˆ˜์น˜)
    • task ๋ถˆ๊ท ํ˜•: ํ•ต์‹ฌ ์ฃผ์žฅ์— ์“ฐ์ด๋Š” KR, TR์ด ๊ฐ 2% ์ˆ˜์ค€
    • CD๋ฅผ lexical metric์œผ๋กœ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ binary task์™€ metric์˜ mismatch
    • text-based ๋ฐฉ๋ฒ•์— GPT-4o caption์„ ์ œ๊ณตํ•˜๋Š” ์„ค๊ณ„๋Š” caption์˜ ์ •๋ณด ์šฐ์œ„์™€ raw image ์ฒ˜๋ฆฌ ๋‚œ์ด๋„๋ฅผ ๋ถ„๋ฆฌํ•˜์ง€ ๋ชปํ•จ
    • ์˜์–ด single-language, ์ตœ๋Œ€ 1๋…„ time span์œผ๋กœ ์ œํ•œ โ€“>

Personal note. ์ „์— ๊ต์ˆ˜๋‹˜๊ป˜ ์–ธ๊ธ‰๋งŒ ๋“œ๋ ธ๋˜,, ๋Œ€ํ™” ๋ฐ–์— ์กด์žฌํ•˜๋Š” agent๋ฅผ observer๋กœ ํ•˜์—ฌ ์ด observer์˜ ๊ด€์ ์—์„œ memory์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒค์น˜๋งˆํฌ ๋…ผ๋ฌธ์œผ๋กœ์„œ multimodal ๊ณผ multi-party matrix ๋Š” ๊น”๋”ํžˆ ์ฑ„์› ๊ณ , task ๋‚œ์ด๋„๊ฐ€ interaction ๊ตฌ์กฐ์— ๋”ฐ๋ผ ์—ญ์ „๋จ์„ ๋ณด์ธ ๊ฒŒ ์ข‹์•„๋ณด์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ์‹คํŒจ์ค‘์— speaker๋ฅผ ์ž˜ ๋ชป๋งž์ถ˜๋‹ค๋Š” ๊ฑด ๋˜๊ฒŒ ๊ณ ์ „ task ๊ฐ™๊ธด ํ•œ๋ฐ ์—ฌ์ „ํ•œ ๋ฌธ์ œ๋ผ๊ณ ๋Š” ํ•˜๋Š”๊ฒŒ ํด๋ž˜์‹์€ ์˜์›ํ•œ๊ฐ€ ์‹ถ๊ณ ์š”.