1 minute read

Meta info.

TL; DR

chitchat๊ณผ task request๊ฐ€ ๊ฒฐํ•ฉ๋œ multi-turn dialogue ์ž๋™ ๊ตฌ์ถ•ํ•˜๋Š” framework CTFUSION ์ œ์•ˆ, ์ด๋ฅผ ํ™œ์šฉํ•ด ๋งŒ๋“  IVSR-CTF ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ•œ ICS ๋ชจ๋ธ์ด ๊ธฐ๋Šฅ ์˜๋„ ๋ถ„๋ฅ˜์—์„œ LLM์„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ ๊ทธ ํšจ๊ณผ ํ™•์ธ

image.png

image.png

image.png

image.png

image.png

image.png

Background

  • ๊ธฐ์กด IVSR์˜ ๋‹จ๋ฐœ์„ฑ (single-turn) NLU์‹ request์— ํŠนํ™”
  • Chitchat์ด ๊ฐ€๋Šฅํ•œ LLM์€ latency ํ•œ๊ณ„
  • ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ ์—ญ์‹œ ์ ์€ ์˜๋„์ˆ˜ (์ฐจ๋Ÿ‰ ํŠนํ™”๋„ ์•„๋‹˜), ํŠน์ • ์‹œ๋‚˜๋ฆฌ์˜ค์— ํ•œ์ •

Problem States

LLM chat ๋ชจ๋“ˆ๊ณผ NLU ์ž‘์—… ๋ชจ๋“ˆ์„ ํ†ตํ•ฉํ•œ IVSR์—์„œ, ๊ฐ ๋ฐœํ™”์˜ chat/task๋ฅผ ์ •ํ™•ํžˆ ์‹๋ณ„ํ•ด์•ผ ํ•œ๋‹ค.

  • ๋Œ€๊ทœ๋ชจ, ์ฐจ๋Ÿ‰ํŠนํ™”, multi-turn chat 2 task ๋ฐ์ดํ„ฐ์…‹ ๋ถ€์žฌ
  • mode ๋ถ„๋ฅ˜ ์‹คํŒจ์‹œ
    • task๋ฅผ chat๋กœ ์˜ค์ธ: LLM hallucination
    • chat์„ task๋กœ ์˜ค์ธ: Resource ๋‚ญ๋น„

Suggestions

  • CTFUSION์˜ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœย IVSR-CTFย ๊ตฌ์ถ•
    • ๊ตฌ์ถ• ๊ณผ์ •
      1. intent-slot set ๊ตฌ์„ฑ:ย GPT-4o๋กœ ๊ฐ ์˜๋„๋ณ„ ํ•„์ˆ˜/์„ ํƒ ์Šฌ๋กฏ ๊ตฌ์ถ• (ontology ๊ตฌ์ถ•)
      2. Action Sequence ์„ ํƒ:ย ๋ฐœํ™” ํ๋ฆ„(์™„์ „ํ•œ/๋ถˆ์™„์ „ํ•œ slot filling)/chat ๊ธธ์ด ์‚ฌ์ „ ์ •์˜
      3. User Data Seed ์„ ํƒ:ย real user utterance๋ฅผ seed๋กœ ๋‹ค์–‘์„ฑ+ํ˜„์‹ค์„ฑ ํ™•๋ณด
      4. Dialogue Generation:ย intent/slot/action sequence ๊ธฐ๋ฐ˜ GPT-4o ์ƒ์„ฑ
      5. Dialogue Augmentation:ย ํ† ํ”ฝ ๋ชจ๋ธ๋ง(LDA) + GPT-4o rewriting์œผ๋กœ ์ฃผ์ œ ๋‹ค์–‘ํ™”/๊ธธ์ด ์กฐ์ •
  • ๋ฐ์ดํ„ฐ์…‹ ๊ฐœ์š”
    • ํ•œ๊ตญ์–ด ์•ฝ 42K ๋Œ€ํ™”
    • 14๊ฐœ ๋„๋ฉ”์ธ ์ดํ•˜์˜ 240๊ฐœ ์ฐจ๋Ÿ‰ ๊ด€๋ จ intents
    • ๋Œ€ํ™”๋‹น 8.5 turns, chitchat to task transition๋งŒ์œผ๋กœ ๊ตฌ์„ฑ
  • ๋ฐ์ดํ„ฐ์…‹ ๊ฒ€์ฆ: Quality evaluation (3-point scale, G-Eval + human)
    • Naturalness, Coherence, Efficiency
  • ICS ๋ชจ๋ธ ๊ตฌ์ถ•: Task or Chat ๊ตฌ๋ถ„. LoRA-tuned LLaMA-3.2-3B-Instruct

Effects

  • Experiment setup:
    • 30K train / 4K dev / 4K test + unseen intent 24 + real user utterance 366
    • baselines: GPT-4o, GPT-4o Mini, EXAONE 3.5-32B, Phi-4-14B, LLaMA-3.2-3B
    • metrics: accuracy / F1-score for ID
  • Results: Domain-specific tuning์ด ํ•„์ˆ˜๋‹ค. (์ดํ•˜ acc / f1)
    • GPT-4o (82.62% / 0.899) ๋Œ€๋น„ ์ œ์•ˆ ๋ชจ๋ธ ICS 90.36% / 0.908
      • OOD์—์„œ 90.72% / 0.919
      • realworld utterance 82.51% / 0.874
    • Ablation:ย augmentation ์ œ์™ธ ์‹œ ์‹ค๋ฐ์ดํ„ฐ ์„ฑ๋Šฅ ๋Œ€ํญ ํ•˜๋ฝ (์ •ํ™•๋„ ๊ธฐ์ค€ 82.51% > 62.30%)

Personal note. ์—ฐ๊ตฌ ๋‚ด์šฉ ์ž์ฒด๋Š” ๋‹น์—ฐํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๊ทธ๋ž˜์„œ motivation ์ธก๋ฉด์—์„œ์˜ ์ œ ์ง€๋‚œ ์—ฐ๊ตฌ์™€ ๋‹น์—ฐํžˆ ๋งž๋‹ฟ์€ ์ง€์ ์ด ๋งŽ์€ ๊ฒƒ๋„ ์‚ฌ์‹ค์ด๋ฉฐ, ์„œ์ˆ  ์ธก๋ฉด์—์„œ ์ €ํฌ๊ฐ€ ์–ด๋–ค ๋ถ€๋ถ„์„ ๋” ์–ดํ•„ํ–ˆ์–ด์•ผ ํ–ˆ๋Š”์ง€ (๋ฌผ๋ก  ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ–ˆ์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ..) ์ƒ๊ฐํ•ด๋ณด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. metareviewer๊ฐ€ revisionํ•˜๋ผ๊ณ  ์–ธ๊ธ‰๋˜์—ˆ๋˜ ๊ฒƒ ์ค‘์—, mode selection์ด ์™œ ํ•„์š”ํ•œ์ง€์— ๋Œ€ํ•œ ์ง์ ‘์ ์ธ ๋ ˆํผ๋Ÿฐ์Šค๊ฐ€ ๋˜๊ธฐ๋„ ํ•  ๊ฒƒ์œผ๋กœ ๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. industry ํŽ˜์ดํผ๋ผ ๋ฐ์ดํ„ฐ๊นŒ์ง€ ๊ณต๊ฐœํ•  ๊ฒƒ ๊ฐ™์ง„ ์•Š์ง€๋งŒ, ๋‚˜๋ฆ„ domain/intent ์ˆ˜์ค€์˜ ์ƒ์„ธ์™€ prompt ๋“ฑ์€ ๊ฝค ์ƒ์ˆ ํ•ด๋‘์—ˆ๋„ค์š”.