2 minute read

Meta info.

TL; DR

multi-turn setup์—์„œ์˜ ๋‚œ์ œ 4๊ฐ€์ง€ (Instruction Retention, Inference Memory, Reliable Versioned Editing, Self-Coherence)๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ ์ œ์•ˆ, ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์— ์„ฑ๊ณตํ•˜๋Š” ์ตœ์‹  SOTA ๋ชจ๋ธ๋“ค๋„ ์ œ์•ˆ ๋ฒค์น˜๋งˆํฌ์—์„œ 50์  ๋ฏธ๋งŒ์˜ ์„ฑ๋Šฅ ๊ธฐ๋ก.

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

  • MT-Bench, MT-Eval ๋“ฑ ๊ธฐ์กด ๋ฉ€ํ‹ฐํ„ด ๋ฒค์น˜๋งˆํฌ๋Š” ์ตœ์‹  ๋ชจ๋ธ์ด ์ด๋ฏธ ์„ฑ๊ณตํ•œ ๋“ฏ
  • multi-IF๋ฅ˜๋Š” format์ค‘์‹ฌ โ†’ ์‹ค์งˆ์ ์ธ ๋Œ€ํ™” Setup์˜ ๋ฌธ์ œ ํฌ๊ด„ํ•˜์ง€ ๋ชปํ•จ

Problem States

multi-turn setup์—์„œ ๋ชจ๋ธ์€ ์ตœ์ดˆ instruction์˜ ์œ ์ง€, turn๋‹จ์œ„๋กœ ํฉ์–ด์ง„ implicitํ•œ ์‚ฌ์šฉ์ž ์ •๋ณด๋ฅผ ํšŒ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๊ฒฐํ•ฉ, ์ •๋ณด ์ถ”์  ๋ฐ ์—…๋ฐ์ดํŠธ, ์ž๊ธฐ ์ผ๊ด€์„ฑ ๋“ฑ์„ ์ถฉ์กฑํ•ด์•ผ ํ•œ๋‹ค.

Suggestions

Evaluation only benchmark MultiChallenge ๊ณต๊ฐœ

  • 4๊ฐ€์ง€ ๋ฌธ์ œ ์ •์˜ : in-context reasoning์„ ์š”ํ•˜๋Š” ๋ฌธ์ œ
    • Instruction Retention: ์ตœ์ดˆ instruction์„ ์ž˜ ์œ ์ง€ํ•˜๋Š”์ง€
      • ํ˜•์‹์ œ์•ฝ, ์˜๋ฏธ์  ์ œ์•ฝ ๋ชจ๋‘ ํฌํ•จ
      • ๋งˆ์ง€๋ง‰ ์‘๋‹ต์ด ์ด ์ œ์•ฝ์„ ๋๊นŒ์ง€ ์œ ์ง€ํ•˜๋Š”๊ฐ€?
        • UK ๋“ฑ๊ธ‰ ์ œ์•ฝ์— ๋งž์ถฐ์„œ ์˜ํ™” ์ถ”์ฒœํ•ด์ค˜
    • Inference Memory: ์ด์ „ turn๋“ค์—์„œ ํฉ์–ด์ง„ (+์•”์‹œ์ ์ธ) ์‚ฌ์šฉ์ž ์ •๋ณด๋ฅผ ์ ์ ˆํžˆ ์ทจํ•ฉ, ์„ ํƒํ•˜์—ฌ ์ตœ์ข… turn ์š”๊ตฌ์— ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
      • ์ตœ์ข… turn์—์„œ ํ•ด๋‹น ์ •๋ณด๋ฅผ ์ง์ ‘ ๋ฌป์ง€ ์•Š๊ณ , ๋ชจ๋ธ์€ ์ด๋ฅผ ๋งฅ๋ฝ์ƒ ์ถ”๋ก  (๋‹จ์ˆœ ํ‚ค์›Œ๋“œ ๋งค์นญ ์ด์ƒ์˜ ์š”๊ตฌ)
      • ์ดˆ๋ฐ˜ ์•Œ๋ ˆ๋ฅด๊ธฐ/์ทจํ–ฅ ์ œ์•ฝ์„ ์žŠ์ง€ ์•Š๊ณ  ์œ ์ง€ํ•˜๋Š”๊ฐ€?
        • ๊ฒฌ๊ณผ๋ฅ˜ ์•Œ๋ ˆ๋ฅด๊ธฐ์— ๋Œ€ํ•ด ์–ธ๊ธ‰ํ–ˆ์œผ๋ฉด, ๋งˆ์ง€๋ง‰๊นŒ์ง€ ๋ ˆ์‹œํ”ผ์— ๊ฒฌ๊ณผ๋ฅ˜๊ฐ€ ์•ˆ๋“ค์–ด๊ฐ€๋Š”์ง€
      • ๊ด€๋ จ์—†๋Š” ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ๋Œ์–ด๋‹ค๊ฐ€ (๊ณผ์ž‰)๋ฐ˜์˜ํ•˜์ง„ ์•Š๋Š”๊ฐ€?
    • Reliable Versioned Editing: coreference ๋“ฑ ํ•ด์†Œ โ†’ ์˜ฌ๋ฐ”๋ฅธ ๋ฒ„์ „ ๋ณต๊ธฐ โ†’ ํŽธ์ง‘ ์—…๋ฐ์ดํŠธ
      • ์—ฌํ–‰ ์ผ์ •, ์ด๋ฉ”์ผ, ์ฝ”๋“œ ๋“ฑ ์—ฌ๋Ÿฌ ๋ฒ„์ „์„ ์˜ค๊ฐ€๋ฉฐ ์ •ํ™•ํ•œ ์ฐธ์กฐ ๋ฒ„์ „์„ ๊ทธ๋Œ€๋กœ copyํ•˜๊ณ  ๊ทธ ์œ„์— ์ƒˆ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•œ์ง€
      • ๋์‹œ๊ฐ„ ๋ฐ”๊พธ๊ธฐ ์ „ ์ผ์ •์œผ๋กœ ๋Œ์•„๊ฐ€์„œ ๋ณด์—ฌ์ค˜ โ€ฆ
    • Self-Coherence: ๋ชจ๋ธ ์Šค์Šค๋กœ์˜ ๋ฐœ์–ธ์ด ์ผ๊ด€๋˜๋Š”์ง€ ํ™•์ธ
      • ์œ ์ €์˜ ์œ ๋„์— ๋”ฐ๋ผ์„œ๋„ ์•„์ฒจํ•˜๋Š” ๋“ฑ ๋ฒ„๋ณตํ•˜์ง€ ์•Š๊ณ  ์ž๊ธฐ ๋ชจ์ˆœ์ด ์—†๋Š”์ง€
      • ์ด์ œ ๋‹ค ๋œ๊ฑฐ์ฃ ? ๊ฐ™์€ ํ•จ์ • ์งˆ๋ฌธ
  • ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•: ํ‰๊ท  5-turn์˜ ์ด 273๊ฐœ ๋Œ€ํ™”
    • Instruction Retention(113), Inference Memory(69), Reliable Versioned Editing(41), Self-Coherence(50)
    • MMSE: ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ
      • ๋ชฉํ‘œ: ํ˜„์‹ค์ ์ด๊ณ  ๋ชจ๋ธ๋“ค์ด ์‹ค์ œ๋กœ ํ‹€๋ฆด ๋งŒํ•œ ์˜ˆ์ œ ๊ตฌ์ถ•
      • (๋ชจ๋ธ์ด) ๊ตฌ์ถ•: role ๊ตฌ๋ถ„ํ•˜์—ฌ LLM ํ™œ์šฉ
        • Planner: topic hierarchy + persona + 4๊ฐ€์ง€ ๋ฌธ์ œ์นดํ…Œ๊ณ ๋ฆฌ์ค‘ 1๊ฐœ โ†’ blueprint ์ƒ์„ฑ
        • User Agent: blueprint๋ฅผ ์‹ค์ œ ๋Œ€ํ™” turn์œผ๋กœ ๊ตฌ์ฒดํ™”
        • Responder: 6๊ฐœ Frontier ๋ชจ๋ธ ์ค‘ ๋žœ๋ค ์„ ํƒ โ†’ ์‘๋‹ต ์ƒ์„ฑ (3๊ฐœ ์ด์ƒ์ด ์‹คํŒจํ•ด์•ผ ๋‚จ๊น€)
      • (์‚ฌ๋žŒ์ด) ํŽธ์ง‘: ์ „์ฒด ์ค‘ ์•ฝ 1/4 ์ •๋„์˜ ๋ฌธ์žฅ ์ˆ˜์ •
        • ๊ฐ instance์— ๋Œ€ํ•ด ๋งจ ๋งˆ์ง€๋ง‰ ์‘๋‹ต๋งŒ ๋ณด๊ณ .
        • ๋ฃจ๋ธŒ๋ฆญ ์„ค๊ณ„ ์ถ”๊ฐ€ : yes or no๋กœ ์ฑ„์  ๊ฐ€๋Šฅํ•˜๋„๋ก
      • (์‚ฌ๋žŒ์ด) ๊ฒ€์ˆ˜: ๋‘ ๋ช…์˜ ๋…๋ฆฝ๋œ ๊ฒ€์ˆ˜์ž๊ฐ€
        • ์ž์—ฐ์Šค๋Ÿฌ์šด์ง€
        • ์ •๋ง 1๊ฐœ์˜ ๋ฌธ์ œ ๋ถ„๋ฅ˜๋งŒ ํƒ€๊นƒํ•˜๋Š”์ง€
        • 6๊ฐœ ํ›„๋ณด๋ชจ๋ธ์ค‘ ์ง„์งœ 3๊ฐœ ์ด์ƒ์ด ์‹คํŒจํ•˜๋Š”์ง€
        • ๋ฃจ๋ธŒ๋ฆญ ์งˆ๋ฌธ์— ๋ช…ํ™•ํ•˜๊ฒŒ yes or no์ธ์ง€

Effects

  • baselines: closed ๋ชจ๋ธ 6์ข…๊ณผ opensource ๋ชจ๋ธ
  • Table2ย ๋ชจ๋“  frontier ๋ชจ๋ธ์ด 50์  ์ดํ•˜ ๋‹ฌ์„ฑ
    • Claude 3.5 Sonnet์ด ์•ฝ 41์ ์œผ๋กœ ์ตœ๊ณ  ์„ฑ๋Šฅ
    • o1-preview๋Š” 37์  ์ •๋„ (Inference Memory & Versioned Editing์— ๋Šฅ์ˆ™)
    • GPT-4o ๋“ฑ ๋‚˜๋จธ์ง€๋Š” ๊ทธ๋ณด๋‹ค ๋‚ฎ์Œ
    • Table2ย ์‚ฌ๋žŒ์ด ํ‰๊ฐ€ย Table3ย ๋ชจ๋ธ์ด ํ‰๊ฐ€ย Table4ย ๋ชจ๋ธ๊ณผ ์‚ฌ๋žŒ align ํ™•์ธ (์ •ํ•ฉ๋„ 94%๋กœ ๋†’๋‹ค)ย Table5ย opensource ๋ชจ๋ธ
  • turn ์ˆ˜๋Š” ์ƒ๊ด€ ์—†๊ณ  ์ถ”๋ก ๋ ฅ์˜ ๋ฌธ์ œ

Personal note. ์ง€๊ธˆ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋Š” memory ์—ฐ๊ตฌ์—์„œ ๋ณด๋ ค๊ณ  ํ–ˆ๋˜ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ์˜์‹๋“ค์ด ์–ด๋А์ •๋„ ๋‹ด๊ธด ๋ฒค์น˜๋งˆํฌ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ž˜ ํ’€๊ฒŒ ํ•˜๋Š” ๊ฒƒ๋„ ์ถฉ๋ถ„ํ•œ ๊ธฐ์—ฌ๊ฐ€ ๋  ์ˆ˜ ์žˆ์–ด ๋ณด์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹๋„ ๊ณต๊ฐœ๋˜์–ด ์žˆ๋Š”๋ฐ ๋ˆˆ์œผ๋กœ ๋ดค์„๋•Œ๋Š” ํ’ˆ์งˆ๋„ ๊ดœ์ฐฎ์•„ ๋ณด์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋Š” ๋งŒ๋“ค๊ณ ๋„ ๋น„๊ณต๊ฐœํ–ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๋ฃจ๋ธŒ๋ฆญ์ด ๊ฒฐ๊ตญ binary์ž„์—๋„ ์•„์ง ๋ชจ๋ธ์ด ์ž˜ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ, ํ‰๊ฐ€์˜ ์‚ฌ๋žŒ-๋ชจ๋ธ๊ฐ„ ์ •ํ•ฉ๋„๋ฅผ ์œ„ํ•ด ์กฐ์ ˆํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ด๋ฉฐ, ํ–ฅํ›„ ๊ณต๊ฐœ ์˜ˆ์ •์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.