1 minute read

Meta info.
  • Authors: Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng
  • Paper: https://arxiv.org/pdf/2406.14868
  • Affiliation: Meta AI, USTC
  • Published: June 21, 2024

TL; DR

Multi-turn ์—์„œ RL Objectives๋ฅผ ์ง์ ‘ optimizeํ•˜๋Š” ์†์‹คํ•จ์ˆ˜์˜ Direct Multi-Turn Preference Optimization (DMPO) ์ œ์•ˆ

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Problem States

ETO์—์„œ DPO loss๋Š” Single-turn ๋‹จ์œ„ ์„ ํ˜ธ์— ๋Œ€ํ•œ ๊ฐ•ํ™”ํ•™์Šต์ด๋ฏ€๋กœ, multi-turn agent task (trajectory๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ)์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.

  • ETOย :์‹คํŒจ trajectory ๋ชจ์•„์„œ contrastiveํ•˜๊ฒŒ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ optimizeํ•˜๊ฒŒ ํ•˜๋Š”, ReAct ๋ฐฉ์‹์œผ๋กœ trajectory ์ž‘์„ฑ
  • Research Question: multi-turn agent task๋ฅผ ์œ„ํ•œ optimization ๊ฐœ๋ฐœ

Suggestions

DMPO

  • multi-turn ์—์„œ์˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด (1) BT ๋ชจ๋ธ์˜ partition function Z(probability ์ •๊ทœํ™”)๊ฐ€ ํ˜„์žฌ ์ƒํƒœ s๋กœ๋ถ€ํ„ฐ ๋…๋ฆฝ์ด ๋˜์–ด์•ผ ํ•จ + (2) ์„ ํ˜ธ/๋น„์„ ํ˜ธ trajectories ๊ฐ„ ๊ธธ์ด ๊ฒฉ์ฐจ์˜ ์˜ํ–ฅ ์ค‘ํ™”๋กœ ํŽธํ–ฅ ์ค„์—ฌ์•ผ ํ•จ
    1. state-action occupancy (SAOM) ์ ์šฉ:ย RL Objectives(Eq 1)์—์„œ Policy Constraints(Eq 3)์„ย SAOM constraints(Eq 10)๋กœ ๋Œ€์ฒดํ•˜์—ฌ compounding error ์™„ํ™”
      1. problem:ย Eq 3ย ์—์„œ Z(s)๋Š” ํ˜„์žฌ ์ƒํƒœ s์— ์ข…์†๋œ ์ƒํƒœ๋กœ ์ •๊ทœํ™”โ†’ ๋‹จ์ผ ํ„ด์—์„œ๋งŒ ์œ ํšจํ•œ ์ ‘๊ทผ
      2. solution:ย Eq 10ย ์—์„œ SAOM constraints(d^{ฯ€โˆ—}(s, a)ย )๋กœ Z๋Š” s์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ โ†’ย Eq 11
    2. BT ๋ชจ๋ธ์— ๊ธธ์ด ์ •๊ทœํ™” ๋„์ž…: ์„ ํ˜ธ trajectories์™€ ๋น„์„ ํ˜ธ trajectories๊ฐ„ ๊ธธ์ด ๋ถˆ์ผ์น˜ ์™„ํ™” โ†’ ํŽธํ–ฅ ๋ฌธ์ œ ํ•ด๊ฒฐ
      1. Eq 2ย ์„ multi-turn์œผ๋กœ ํ™•์žฅํ•˜๋ฉดย Eq 12
      2. problem: ์„ ํ˜ธ trajectory ๊ธธ์ด T^w์™€ ๋น„์„ ํ˜ธ trajectory ๊ธธ์ด T^l์ด ๋ถˆ์ผ์น˜ (๊ธธ์ด๊ฐ€ ๊ธธ์ˆ˜๋ก reward ํ•ฉ์ด ์ปค์ง€๋Š” ํŽธํ–ฅ ๋ฐœ์ƒ > ๊ฒฉ์ฐจ ํ™•๋Œ€ > ๋ชจ๋ธ ์„ฑ๋Šฅ ์ €ํ•˜ )
      3. solution:ย Eq 13์ฒ˜๋Ÿผ ์ •๊ทœํ™” (T^w(์„ ํ˜ธtrajectory ๊ธธ์ด)๊ฐ€ ๋” ๊ธด ๊ฒฝ์šฐ, T^w์— ๋ถ™์€ ์ •๊ทœํ™” term์ด T^l์— ๋ถ™์€ term ๋Œ€๋น„ ์ž‘์€ ๊ฐ’์ด ๋˜๋Š” ์‹์œผ๋กœ ๋ณด์ •)
  • ์ตœ์ข…ย Eq 16๋ฅผ maximize:ย Eq 13์—๋Š” Z์˜ partition function์ดย Eq 11์˜ reward function์œผ๋กœ ๋Œ€์ฒด๋˜๋ฉด์„œ ์—†์–ด์ง.
    • discount function ฯ•(t, T): ๋‹ค์–‘ํ•œ ๋‹จ๊ณ„์—์„œ s-a pair์˜ ๊ฐ€์ค‘์น˜ ์žฌ์กฐ์ •(์ดˆ๊ธฐ ๋‹จ๊ณ„์˜ s-a pair์— ๋” ๋†’์€ ๊ฐ€์ค‘์น˜)

Effects

  • Experimental Setup:
    • datasets: WebShop, ScienceWorld, ALFWorld (MDP๋กœ ์„ค๋ช…๊ฐ€๋Šฅ)
    • backbone: Llama-2-7B-Chat, Llama-2-7B-Chat
  • Results:
    • (RQ1)ย noisy setting: DMPO์˜ ๊ฐ•๊ฑด์„ฑ + ํšจ์œจ์„ฑ ํ™•์ธย ยปย Table 2ย DPO ์„ฑ๋Šฅ ์ƒํšŒ
      • noisy trajectory๋ฅผ ๋น„์„ ํ˜ธ trajectory๋กœ ๋Œ€์ฒด ์‹คํ—˜: ์ดˆ๊ธฐ์— gold preference์— ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ์„œ ์šฐ์„ ์‹œํ•˜๊ณ  ๋‚˜์ค‘ ๋‹จ๊ณ„์—์„œ ๋ฐœ์ƒ๋˜๋Š” ๋…ธ์ด์ฆˆ ์žˆ๋Š” ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ ๋‚ฎ์ถ”๋Š” ๋“ฑ ๋…ธ์ด์ฆˆ์˜ ์˜ํ–ฅ์„ ์™„ํ™”ํ•˜๊ณ  ๋ชจ๋ธ์— ํ–ฅ์ƒ๋œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ™•์ธ
    • (RQ2)ย clean setting: DMPO์˜ ์šฐ์ˆ˜์„ฑ ํ™•์ธ
      • baseline preference tuning ๋ฐฉ์‹ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ํ™•์ธ

Personal note. ETOย paper ๋‚ด์šฉ์„ ์ „๋ฉด์œผ๋กœ ๋ฐ›์•„๋“ค์ด๋ฉด์„œ ์‹œ์ž‘ํ•˜๊ณ  ์žˆ์–ด์„œ ํ™•์ธ ํ•„์š”ํ•ด๋ณด์ž„. target benchmark๋“ค์ด ReAct ๊ธฐ๋ฐ˜์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ด๋Š”๋ฐ, ์‹ค์ œ multi-turn setting ๊ณผ ์–ผ๋งˆ๋‚˜ align๋˜๋Š”์ง€๊ฐ€ ํ–ฅํ›„ ์—ฐ๊ตฌ๋ฐฉํ–ฅ ์žก๋Š”๋ฐ ์ฃผ์š”ํ•œ๋“ฏ