3 minute read

Meta info.

TL; DR

Assistant์šฉ LM์„ user์ฒ˜๋Ÿผ ์—ญํ•  ์ง€์‹œํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋น„ํ˜„์‹ค์ ์ด๋ฉฐ, ์‹ค์ œ human user ํ–‰๋™์„ ํ•™์Šตํ•œ UserLM์ด ํ›จ์”ฌ ๋” ์ž์—ฐ์Šค๋Ÿฌ์šด multi-turn user behavior๋ฅผ ์žฌํ˜„ํ•ด assistant ์„ฑ๋Šฅ์˜ ์ง„์งœ ํ•œ๊ณ„๋ฅผ ๋“œ๋Ÿฌ๋‚ธ๋‹ค.

image 1 image 2 image 3 image 4 image 5 image

Background

  • assistant LLM์€ ์‹ค์ œ multi-turn ๋Œ€ํ™”์— ๋…ธ์ถœ๋˜๋‚˜, ๋Œ€๋ถ€๋ถ„์˜ ํ‰๊ฐ€๋Š” single-turn static benchmark ๊ธฐ๋ฐ˜
    • ์ง€๋‚œ ์—ฐ๊ตฌ์—์„œ ๋ชจ๋ธ์˜ multi-turn ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์ง€๋Š” ๊ฒƒ์„ ์ด๋ฏธ ํ™•์ธ
  • ํ˜„์‹ค user๋Š”
    • ์›ํ•˜๋Š” ์˜๋„(intent)๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค ๋‚ฑ๋‚ฑ์ด ๋ฐํžˆ์ง€๋„ ์•Š๊ณ  (turn๋งˆ๋‹ค ์ถ”๊ฐ€)
    • ๋ชจํ˜ธํ•˜๊ณ  ๋ถˆ์™„์ „ํ•˜๊ฒŒ ๋ฐœํ™”ํ•˜๋ฉฐ,
    • ๋‹จ๋ฌธ, ๋ถˆ์นœ์ ˆํ•œ๋ฌธ์žฅ์„ ์‚ฌ์šฉํ•˜๋ฉฐ,
    • ๋Œ€ํ™” ์ข…๋ฃŒ๋ฅผ ์Šค์Šค๋กœ ๊ฒฐ์ •
  • ๋ฐ˜๋ฉด assistant๋Š” ์• ์ดˆ์— user์ฒ˜๋Ÿผ ๋  ์ˆ˜ ์—†์Œ
    • ํ˜‘์กฐ์ ์ด๊ณ , ๊ตฌ์กฐํ™”๋˜์—ˆ์œผ๋ฉด์„œ
    • ์™„์ „ํ•œ ๋ฌธ์žฅ์„ ๊ตฌ์‚ฌํ•˜๋„๋ก ํ•™์Šต๋œ
    • ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์˜ค๋ฅ˜๋ฅผ ํ”ผํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ •์ œ
    • ๊ทธ๋Ÿฐ๋ฐ user ๋ฅผ ๋ชจ์‚ฌํ• ๋•Œ๋Š” LLMํ•œํ…Œ role ๋ถ€์—ฌํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ ํ•œ์ค„์ด ํ˜„์‹ค

Problem States

User๋ฅผ ๋ชจ์‚ฌํ•˜๋Š” LM์„ ๋งŒ๋“ค์ž

  • ์ง„์งœ assistant LLM์ด user๊ฐ™์ง€ ์•Š์€๊ฐ€?
    • ์‹คํ—˜: Assistant LM์˜ ์„ฑ๋Šฅ์ด ๋†’์„์ˆ˜๋ก user simulator๋กœ๋Š” ๋ถ€์ ์ ˆ
      • GPT-4o๊ฐ€ ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์—ฐ๊ธฐํ•  ๊ฒƒ ๊ฐ™์ง€๋งŒ
      • ์˜คํžˆ๋ ค user-like behavior์—์„œ ๋ฉ€์–ด์ง

Suggestions

  • User intent๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฒซ turn ์ƒ์„ฑ
    • intent๋ฅผ high-level๋กœ ์œ ์ง€
      • ๋‹ค์–‘์„ฑ์„ ์œ„ํ•ด ์–ด๋А์ •๋„์˜ ์•„์›ƒ๋ผ์ธ๋งŒ ์คŒ
      • ์•„์˜ˆ ์—†์„ ๊ฒฝ์šฐ steering์˜ ์–ด๋ ค์›€
      • ์˜ˆ: โ€œYou are a user chatting with an assistant to get advice about weight loss.โ€
  • Assistant์˜ ์‘๋‹ต์œผ๋กœ ๋‹ค์Œ user turn ์ƒ์„ฑ (์ฆ‰ assistant-turn ์ƒ์„ฑํ•˜์ง€ ์•Š์Œ)
    • =>์ด๋ฅผ flipping ์ด๋ผ๊ณ  ํ‘œํ˜„
  • ์ ์ ˆํ•œ ์‹œ์ ์— ๋Œ€ํ™” ์ข…๋ฃŒ
  • ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์˜ assistant ๋Œ€์‹  user์ชฝ turn์„ ํ•™์Šต์— ํ™œ์šฉ
  • training details:
    • dataset: WildChat์—์„œ deduplicateํ•œ ์•ฝ 38๋งŒ ๋Œ€ํ™”
    • intent: GPT-4o fs์œผ๋กœ ์ƒ์„ฑ
    • backbone: Llama-3.2-1B / Llama-3-8B
      • instructํ•˜์ง€ ์•Š์€ base ์‚ฌ์šฉ
        • instruct ์“ฐ๋ฉด ์„ฑ๋Šฅ ๋‚˜๋น ์กŒ๋‹ค๊ณ 
        • instruct๋Š” assistant๋กœ์„œ์˜ post-training์œผ๋กœ user ํ–‰๋™์—์„œ ๋” ๋ฉ€์–ด์ง
    • hyper-parameters: max length 2048-token, 1024-batch, LR 2e-5, A6000*4, 8B๋ชจ๋ธ ๊ธฐ์ค€ 227์‹œ๊ฐ„ ํ•™์Šต

Effects

  • PPL: Distributional Alignment tab 1
    • UserLM-8B๊ฐ€ WildChat, PRISM ๋ชจ๋‘์—์„œ ๊ฐ€์žฅ ์ž‘์€ PPL ๋‹ฌ์„ฑ
    • intent conditioning์€ ๋ชจ๋“  ๋ชจ๋ธ์˜ PPL ๊ฐ์†Œ (steering ๊ฐ€๋Šฅ์„ฑ ์ฆ๊ฐ€๋กœ ํ•ด์„)
    • base๋ชจ๋ธ์ด Instruct ๋ชจ๋ธ๋ณด๋‹ค UserLM ์„ฑ๋Šฅ์œผ๋กœ๋Š” ๋” ์šฐ์ˆ˜
  • Multi-turn Interaction tab 2; ์ธก์ • ๋Œ€์ƒ์œผ๋กœ ์ฒซ-turn์˜ ๋‹ค์–‘์„ฑ, intent decomposition, ๋Œ€ํ™” ์ข…๋ฃŒ ํ™•์ธ
    • ์ฒซํ„ด ๋‹ค์–‘์„ฑ: Assistant๋กœ ํ•™์Šต๋œ LM์€ user-turn์ด ๊ฑฐ๊ธฐ์„œ ๊ฑฐ๊ธฐ
      • UserLM์ด 94.55%๋กœ Real user(94.01%)์— ๊ทผ์‚ฌ
      • gpt-4o๊ฐ€ 74.42%
    • Intent decomposition: ์‹ค์ œ ์‚ฌ๋žŒ์ด ์ •๋ณด๋ฅผ ์ฒœ์ฒœํžˆ ํ’€์–ด๋‚ด๋“ฏ UserLM์ด ์ด๋ฅผ ๋ชจ์‚ฌํ•œ๋‹ค๊ณ  ํ•ด์„
      • intent์™€ n-gram overlap์ด ๋‚ฎ์„์ˆ˜๋ก ์ข‹๋‹ค๊ณ  ๊ฐ€์ •
      • real user๊ฐ€ 1.68%์ผ ๋•Œ UserLM์ด 2.69%๋กœ ๊ฐ€์žฅ ๊ทผ์‚ฌ
      • gpt-4o 7.68% ๋“ฑ
    • Dialogue Termination: AssistantLM์€ ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋Œ€ํ™”๋ฅผ ๋๋‚ด์ง€ ์•Š์Œ.
      • < endconveration >์˜ F1-score์— ๋Œ€ํ•ด
      • UserLM์ด 63.54๋กœ GPT-4o์˜ 3.31๋Œ€๋น„ ์›”๋“ฑํ•œ ์ˆ˜์ค€
  • Simulation Robustness ; naturalness, user-role adherence, intent adherence
    • naturalness: prompt๋กœ user role์ด ๋ถˆ๊ฐ€ํ•จ์„ ์‹œ์‚ฌ
      • real user 90%์— ๋Œ€ํ•ด UserLM-8B 80.21
      • Assistant์— promptingํ•  ๋•Œ 0-3%์ˆ˜์ค€
    • user-role adherence: Assistant๊ฐ€ ์งˆ๋ฌธํ•˜๋ฉด User์ฒ˜๋Ÿผ ๋ชจ๋ฅด๋Š” ์ฒ™ ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
      • UserLM-8B๋Š” 93.95%๋กœ ์ตœ๊ณ  ์ˆ˜์ค€. GPT-4o๊ฐ€ 38% ์ˆ˜์ค€ (gpt-4o-mini๋Š” 80.20%)
      • ์•„๋งˆ gpt-4o์˜ helpfulness๋•Œ๋ฌธ์— user-role์—์„œ ์ดํƒˆํ•  ๊ฒƒ
    • intent adherence: assistant๊ฐ€ ์˜๋„ ํŒŒ์•… ๋ชปํ•˜๊ณ  ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์„ฑ์„ ์ œ์‹œํ•  ๋•Œ ๊ฑฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
      • UserLM-8B๊ฐ€ 94.65%์„ฑ๋Šฅ์œผ๋กœ ์ œ๋Œ€๋กœ ํŒŒ์•…. gpt-4o๋Š” ์•ฝ 70.95%์ •๋„
      • UserLM์ด obstinateํ•œ ์ธ๊ฐ„์„ ์ž˜ ์žฌํ˜„ํ–ˆ๋‹ค๊ณ  ํ‰๊ฐ€
  • Coding, Math multi-turn simulation Fig 1 : GSM8K, HumanEval
    • AssistantLM: GPT-4o
    • UserLM์„ user๋กœ ๋‘๋ฉด assistant๊ฐ€ 74.6%์—์„œ 57.4%๋กœ ๊ธ‰๋ฝ
    • ์ฆ‰ GPT๊ธฐ๋ฐ˜ user simulator๋Š” ๋„ˆ๋ฌด ์นœ์ ˆํ•˜๋ฏ€๋กœ assistantLM์— ์œ ๋ฆฌํ•œ ํ™˜๊ฒฝ์„ ์ œ๊ณต
  • Simulation Behavior tab 3 : UserLM์€ real user์ฒ˜๋Ÿผ ์ •๋ณด๋ฅผ ๋ฐ˜๋ณตํ•˜๊ณ  ์ถ”๊ฐ€ constraints๋ฅผ ๋„ฃ๊ณ  lexically diverseํ•˜๊ณ  turn ๊ธธ์ด๋„ ๋‹ค๋ณ€ํ™”

Personal note.

  • ์ฃผ์š” ์ €์ž ๋ฐ ๊ต์‹ ์ €์ž๊ฐ€ ๊ฒน์นœ ๊ฒƒ์œผ๋กœ ๋ณด์•„ instruction์„ multi-turn์œผ๋กœ shardํ•ด์„œ ์ฃผ๋ฉด ๋” ๋ชปํ•˜๋”๋ผ๋Š” MS ๋…ผ๋ฌธ์˜ ํ›„์†(ํ˜น์€ ์ง„์งœ ํ•˜๊ณ ์‹ถ์—ˆ๋˜ ์—ฐ๊ตฌ์˜ ๋ฐฉํ–ฅ)์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์‹ค์ œ ์œ ์ €๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ๋‹ค ์ฃผ์ง€ ์•Š๋Š”๋ฐ, ํ˜„์žฌ ๋ฒค์น˜๋งˆํฌ๋‚˜ ๋‹ค์–‘ํ•œ ์‹คํ—˜๋“ค์€ assistant ์นœํ™”์ ์œผ๋กœ ์ •๋ณด๋ฅผ ๋‹ค ์ฃผ๊ณ  ์‹œ์ž‘ํ•˜๊ฑฐ๋‚˜, ์‹ค์ œ ์‚ฌ์šฉ ํ™˜๊ฒฝ(multi-turn ํ™˜๊ฒฝ)์„ ์žฌํ˜„ํ•ด๋‚ด์ง€ ๋ชปํ•œ๋‹ค๊ณ  ์ง€์ ํ•œ ๊ฒƒ๋„ ํƒ€๋‹นํ•˜๊ณ 
  • ๋ฐฉ์‹์€ ๋ฌด์ฒ™ ๋‹จ์ˆœํ•˜์ง€๋งŒ; human ์—ญํ•  ๋ฐ์ดํ„ฐ๋ฅผ instruct์•ˆ ๋œ ๋ชจ๋ธ์— ํ•™์Šต์‹œํ‚ค๊ธฐ ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ๋งค์šฐ ๋ถ„๋ช…ํ•œ ๊ฒŒ ํฐ ๋งค๋ ฅ์œผ๋กœ ๋‹ค๊ฐ€์˜ต๋‹ˆ๋‹ค.
  • HCI์˜ ์˜์—ญ์„ ์‚ด์ง ๋น—๊ฒจ์„œ Human-machine dialogue๋ฅผ ์—ฐ๊ตฌํ•  ๋•Œ ๊ทผ๋ณธ์ ์œผ๋กœ ์ง€์ ํ•˜๊ณ  ๋„˜์–ด๊ฐ”์–ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์งš์€ ๊ฒƒ ๊ฐ™์•„์„œ ์ธ์ƒ๊นŠ๊ฒŒ ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ง€๋‚œ ๋žฉ๋ฏธํŒ…์—์„œ ๊ต์ˆ˜๋‹˜๊ป˜์„œ ๋งˆ์น˜ ์ง€๊ธˆ LLM์œผ๋กœ ๋ฐ์ดํ„ฐ ์ƒ์„ฑํ•˜๊ณ  ๊ฒ€ํ† ํ•˜๋Š” ๊ฒƒ์—๋Š” ์•„๋ฌด ๋ฌธ์ œ๊ฐ€ ์—†๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์“ฐ๋‹ค๊ฐ€ ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•˜๊ฑฐ๋‚˜ ์‹คํ—˜์—์„œ ๋ณด์ด๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ ๊ทธ๋Ÿฐ LLM๋“ค์ด ๋ชปํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•˜๋Š” ํ˜„์žฌ ์—ฐ๊ตฌ ํ๋ฆ„์—์„œ ํ•œ๊ณ„๊ฐ€ ๋А๊ปด์ง„๋‹ค๊ณ  ํ–ˆ๋˜ ๊ฒƒ๋„ ๊ธฐ์–ต์ด ๋‚˜๊ณ ์š”.
  • dialogue๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์œ ์ตํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์‹ถ์€ ํ๋ฆ„์ž…๋‹ˆ๋‹ค.