1 minute read

Meta info.

TL; DR

Multi-turn ํ™˜๊ฒฝ์—์„œ LLM self-reflection & correction ๊ฐ•ํ™” frameworkย Agent-Rย ์ œ์•ˆ

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

LLM Agent์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ€ ๋†’์•„์ง€์ง€๋งŒ, ์•„์ง ํ•œ๊ณ„์  ์‚ฐ์žฌ

Problem States

- ์˜ค๋ฅ˜ ์ˆ˜์ • ๋ชปํ•จ: ๊ธฐ์กด์˜ strong ๋ชจ๋ธ์˜ ํ–‰๋™(expert trajectories)์„ ๋ณต์ œํ•˜๋Š” ๊ฒƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœํ•˜๋Š” ํ•™์Šต์€ ์˜ค๋ฅ˜ ์ˆ˜์ •์ด๋‚˜ path ์ˆ˜์ •์„ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•˜์ง€๋Š” ๋ชปํ•จ
- real-time ์ˆ˜์ • ๋ชปํ•จ: single-turn ์ค‘์‹ฌ์ด๊ณ  ์‹คํ–‰ ์™„๋ฃŒ ํ›„ (๋Œ€ํ™” ์ข…๋ฃŒ ํ›„) feedback์ด ๋ช…์‹œ์ ์ด๊ฒŒ (์ฝ”๋“œ ์‹คํ–‰ ํ›„ ์˜ค๋ฅ˜๋ฉ”์„ธ์ง€) ์ œ๊ณต๋˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ฉด ์ˆ˜์ •ํ•  ๋ฐฉ๋ฒ•์ด ์—†๋Š” ํŽธ โ†’ ์˜ค๋ฅ˜ ๋ˆ„์ 
- ์‹ฌํ•˜๋ฉด loop์— ๋น ์ ธ์„œ ๋ชป๋‚˜์˜ด - **Research question:**ย ์‹ค์‹œ๊ฐ„ + ๋ฉ€ํ‹ฐํ„ด ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋Š” agent framework ๊ฐœ๋ฐœ

Suggestion

Iterative Self-Training

- Reflection Trajectory์„ ์œ„ํ•œ 4๊ฐ€์ง€ trajectory type ์ •์˜: initial, bad, good, revision trajectory
- phrase 1: MCTS๋กœ ๊ฐ€๋Šฅํ•œ trajectory ํƒ์ƒ‰ > ์‹๋ณ„๋œ ์ „ํ™˜ ์ง€์ ์—์„œ bad trajectory (์˜ค๋ฅ˜์ธ trajectory)๋ž‘ good trajectory ์—ฐ๊ฒฐ (์ˆ˜์ •) = correction path ๊ตฌ์„ฑ (Step-Level Reflection Dataset)
- phrase 2: phrase 1์—์„œ ์ƒ์„ฑํ•œ trajectory๋ฅผ ๋ชจ๋ธ์ด RL ํ•™์Šต= ์˜ค๋ฅ˜ ์‹๋ณ„ ๋ฐ ์ˆ˜์ • ๋ฐฐ์šฐ๋„๋ก - **Experiments:**
- `Table 2`ย - Main Table
    - Tasks: WebShop, SciWorld, TextCraft - Agent ํ™˜๊ฒฝ์—์„œ ๋ฒค์น˜๋งˆํฌ
    - metrics: ์ตœ์ข… ๋ฆฌ์›Œ๋“œ๋‚˜ ์„ฑ๊ณต๋ฅ  ๋“ฑ
    - results: Agent-R์ด ๋ชจ๋“  ํ™˜๊ฒฝ์—์„œ SOTA
- `Figure 3, 4, 5`ย - iterative training ๊ด€๋ จ
    - ๊ฐ iteration ์—์„œ Agent-R๊ณผ ๋‹ค๋ฅธ baseline ๋น„๊ต
    - results: ํ•™์Šต ๋ฐ˜๋ณต๋˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ๊พธ์ค€ํžˆ ํ–ฅ์ƒ (iter=3์ด ์ตœ๋Œ€) + loop ๋น ์ง€๋Š” ๋นˆ๋„ ํ˜„์ €ํžˆ ๊ฐ์†Œ + Revision length๋„ ์งง์•„์ง

Personal note. ์ „์— multi-turn preference ์ด์•ผ๊ธฐํ•˜๋ฉด์„œ ์ƒ๊ฐํ–ˆ๋˜ ๋‚ด์šฉ๋“ค์„ ToD Chitchat ์ „ํ™˜ ๋ฌธ์ œ ๋จผ์ € ํ’€๊ฒ ๋‹ค๊ณ  ์ž ์‹œ ๋ฏธ๋ค„๋’€๋Š”๋ฐ, ๊ด€๋ จ ์—ฐ๊ตฌ๊ฐ€ ์ž๊พธ ๋‚˜์˜ค๋ฉด์„œ ์• ๊ฟŽ์€ ์†ํ†ฑ ๋œฏ์Šต๋‹ˆ๋‹ค,,,

Loop ๋น ์ง„๋‹ค๋Š” ๋‚ด์šฉ์€ ํŒ€์›์ด ํ™”์š”์ผ์— ๋ฐœํ‘œํ•ด์ค€ ๋…ผ๋ฌธ์—์„œ๋„ ์–ธ๊ธ‰๋˜์—ˆ๋˜ agent์˜ ์ฃผ์š” ์‹คํŒจ ์š”์ธ์ด๊ณ , multi-turn ํ™˜๊ฒฝ์„ ๊ณ ๋ คํ•˜๊ณ ์ž ํ•˜๋Š” ๋…ธ๋ ฅ์€ ์ œ๊ฐ€ ๊ธˆ์ฃผ ๋ฐœํ‘œ๋“œ๋ฆฐ ๋…ผ๋ฌธ๊ณผ ๋งฅ์„ ๊ฐ™์ด ํ•ฉ๋‹ˆ๋‹ค. ํ’€์ด ๋ฐฉ์‹๋„ ๋น„์Šท..