1 minute read

Meta info.

TL; DR

์ต์ˆ™ํ•œ ์ƒํ™ฉ์„ ์ฒ˜๋ฆฌํ•˜๋Š” intuitive (fast) ์ •์ฑ… ๋ชจ๋ธ๊ณผ ์ƒˆ๋กœ์šด ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์œ„ํ•œ analytical (slow)์˜ ์ •์ฑ… ๋ชจ๋ธ์„ ์ƒํ˜ธ ๋ณด์™„์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ด์ค‘ dialogue planning ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ

image.png

image.png

image.png

image.png

Problem States

LLM์ด ๋ฐ˜์‘์„ ์ž˜ํ•˜๋Š”๊ฑด ๋งž์ง€๋งŒ, ์ •ํ•ด์ง„ ๋ชฉํ‘œ๋ฅผ ํ–ฅํ•ด ๋‹ต์„ ์œ ๋„ํ•˜์ง€๋Š” ๋ชปํ•จ (ToD๋ฅผ ์ž˜ ๋ชปํ•œ๋‹ค๋Š” ์˜๋ฏธ) + ํ”„๋กฌํ”„ํŠธ ์—”์ง€๋‹ˆ์–ด๋ง์ด๋‚˜ ์ถ”๊ฐ€ ํ•™์Šต์˜ ๋น„ํšจ์œจ์„ฑ

Suggestions

์ธ๊ฐ„์˜ dual process ์ด๋ก (์ง๊ด€์  / ๋ถ„์„์  ์‚ฌ๊ณ )์—์„œ ์ฐฉ์•ˆ, DPDP ์ œ์•ˆ

  • LM์˜ uncertainty ๊ธฐ๋ฐ˜ย sys1ย ์™€ย sys2ย ๋™์  ์ „ํ™˜
    • sys1ย : ์ต์ˆ™ํ•œ context์— ๋น ๋ฅด๊ณ  ์ง๊ด€์ ์œผ๋กœ ์‘๋‹ตํ•˜๋Š” policy LM ๋ชจ๋ธ
    • sys2ย : ๋ณต์žกํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ƒํ™ฉ์— ๋ถ„์„์ธ (๊ทธ๋Ÿฌ๋‚˜ ๋А๋ฆฐ) planning์„ ์œ„ํ•œ MCTS ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ
  • ํ™•๋ฅ  ์ฐจ์ด ๊ณ„์‚ฐํ•˜์—ฌ ๋ถˆํ™•์‹ค์„ฑ ์ธก์ •
    • ฮด(ฯ€ฮธ(at st)) = top(1) - top(2)
    • ๊ณ„์‚ฐ๋œ ํ™•๋ฅ  ์ฐจ์ด๊ฐ’์ด ๋ฏธ๋ฆฌ ์ •์˜๋œ ์ž„๊ณ„๊ฐ’ ๋ณด๋‹ค ํฌ๋ฉด Policy LM์ด ํ˜„์žฌ ์˜์‚ฌ ๊ฒฐ์ •์— ๋Œ€ํ•œ ํ™•์‹ ๋„๊ฐ€ ๋†’๋‹ค๊ณ  ํŒ๋‹จ >ย sys1
  • 2๋‹จ๊ณ„ ํ•™์Šต
    1. Offline RL-based Pretraining: LLM์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ ๋Œ€ํ™” ํ„ด์— ์ ์ˆ˜ labelingย ยป ์•ž์„  ์ ์ˆ˜๋ฅผ soft reward๋กœ ํ•˜๋Š” State, Action, Reward๋ฅผ ํฌํ•จํ•˜๋Š” MDP ์ฝ”ํผ์Šค ์žฌ๊ตฌ์„ฑย ยป Q-net LM ์‚ฌ์ „ํ•™์Šต (์ง€๋„ํ•™์Šต ํŽธํ–ฅ, ๋…ธ์ด์ฆˆ ๋“ฑ ๊ฐ์†Œ ํšจ๊ณผ ๊ธฐ๋Œ€)
    2. MCTS-guided Self-play Training: 2๊ฐœ LLM์œผ๋กœ ์ƒํ˜ธ ๋Œ€ํ™”ย ยป MCTS๋กœ action ์˜ˆ์ธก ์‹œ๋„ย ยป ์˜ˆ์ธก๋œ action์€ pre-defined๋œ ์ž์—ฐ์–ด instruction์— ๋งตํ•‘ย ยป ์•ž์„  instruction์ด๋ž‘ ์ด์ „ ๋Œ€ํ™” ๊ธฐ๋ก์„ ์ธํ’‹์œผ๋กœ LLM์ด ๋‹ต๋ณ€ ์ƒ์„ฑย ยป 1์˜ ๋ฐฉ์‹์ฒ˜๋Ÿผ LLM์ด Reward๋กœ ์‚ฌ์šฉํ•  ์ ์ˆ˜ labelingย ยป Policy ํ•™์Šต (Actor-Critic์œผ๋กœ ์ตœ์ ํ™”)

Effects

  • Experimental Setup:
    • Datasets: ESConv, CIMA, CraigslistBargain
  • Result:
    • auto-metric/human metric ๋ชจ๋‘ DPDP๊ฐ€ SOTA
      • turn ์ˆ˜ ์ค„์ด๋ฉด์„œ ๋Œ€ํ™” ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ์— ์˜์˜
    • MCTS ์“ฐ๋ฉด ์„ฑ๊ณต๋ฅ ์€ ๋†’์ด์ง€๋งŒ ์–ด์จŒ๋“  LLM์„ ์จ์•ผ๋˜๋Š” ๋งŒํผ ๋น„์šฉ์€ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ํ•œ๊ณ„