3 minute read

Meta info.

TL; DR

long-horizon task์—์„œ ๋ฐœ์ƒํ•˜๋Š” planning ์‹คํŒจ์˜ ํ•ต์‹ฌ ์›์ธ์„ entanglement๋กœ ๊ทœ์ •, ์ด๋ฅผ subtask ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ๋œ DAG ๊ธฐ๋ฐ˜ planning์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆ, ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋ฐ ํ† ํฐ ์ ˆ๊ฐ์—์„œ ์œ ์˜

Review Video

TDP slide 1 TDP slide 2 TDP slide 3 TDP slide 4 TDP slide 5 TDP slide 6 TDP slide 7 TDP slide 8

TDP figure 1

TDP figure 2

TDP figure 3

TDP figure 4

TDP figure 5

TDP figure 6

Background

  • ์ตœ๊ทผ LLM์ด reasoning๋„ tool use๋„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ, long-horizon task์—์„œ ํ•œ๊ณ„
    • step-wise planning: ReAct, ReCode
      • environment ํ”ผ๋“œ๋ฐฑ์— ์ฆ‰๊ฐ ๋Œ€์‘์ด ๊ฐ€๋Šฅํ•˜๋‚˜,
      • ์žฅ๊ธฐ ๊ณผ์ œ์—์„œ ๊ทผ์‹œ์•ˆ์  ๊ฒฐ์ • ์ˆ˜ํ–‰, ํžˆ์Šคํ† ๋ฆฌ ๋ฌดํ•œ ํ™•์žฅ์— ๋Œ€ํ•œ ํ•œ๊ณ„
    • one-shot (plan-then-act): Plan-and-act, Pre-Act
      • ์ „์—ญ ์ •๋ณด ํŒŒ์•…์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ์ดˆ๊ธฐ planning ์˜ค๋ฅ˜์— ์ทจ์•ฝํ•˜์—ฌ ์žฌ๊ณ„ํš ๋น„์šฉ์ด ํผ

Problem States

LLM-based Agent๊ฐ€ long-horizon task์—์„œ ์‹คํŒจํ•˜๋Š”๊ฑด Entangled planning ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ฃผ์žฅ

  • ์—ฌ๋Ÿฌ subtask์— ๋Œ€ํ•œ thought, ์‹คํŒจ, ์ •๋ณด๊ฐ€ ํ˜ผํ•ฉ๋œ ๊ธด execution history์—์„œ ์ถ”๋ก  (์ธ์ง€ ๋ถ€ํ•˜ ์ฆ๊ฐ€)
    • ๊ทธ์— ๋”ฐ๋ฅธ error propagation (๋กœ์ปฌํ•œ ์‹คํŒจ๊ฐ€ ๋ฌด๊ด€ํ•ด๋ณด์ด๋Š” ๊ฒฐ์ •๊นŒ์ง€ ์˜ํ–ฅ)
    • ๊ทธ๋Ÿฌ๋‚˜ ๋ณต๊ตฌ์˜ ๋น„ํšจ์œจ์„ฑ: ์›์ธ๊ณผ ๋ฌด๊ด€ํ•œ ๋ถ€๋ถ„๊นŒ์ง€ ์žฌ๊ณ„ํš
  • ์„ ํ–‰์—ฐ๊ตฌ ๊ณตํ†ต์ : ๊ตฌ์กฐ์ ์œผ๋กœ ์—ฌ๋Ÿฌ subtask์˜ ์ •๋ณด / ๊ฒฐ์ • / ์‹คํŒจ๊ฐ€ ์„ž์ธ ๋‹จ์ผ ์‹คํ–‰ history ์œ„์—์„œ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์ด ๋ฌธ์ œ
    • ๋ฌธ์ œ์˜ ๋ณธ์งˆ์€ ๊ณ„ํš์„ ์–ผ๋งˆ๋‚˜ ์„ธ์„ธํžˆ ์žก๋А๋ƒ ๊ฐ™์€ ๊ฒŒ ์•„๋‹ˆ๋ผ,
    • context entanglement์— ์˜ํ•œ ์˜ค๋ฅ˜ ์ „ํŒŒ๋ฅผ ์–ด๋–ป๊ฒŒ ์žก๋А๋ƒ = ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋ฒ”์œ„๋ฅผ ์žก์•„์ค˜์•ผ ํ•œ๋‹ค

Suggestions

Task-Decoupled Planning (TDP)

  • ๋ช…์‹œ์  decoupling (=task๋ฅผ ๊ฐ€๋Šฅํ•œํ•œ ๋ถ„๋ฆฌํ•˜๋ฉด์„œ) planning ํ•˜์ž
  • #1 Supervisor: ์ „์ฒด ๊ณผ์ œ๋ฅผ Subtask๋กœ ๋ถ„๋ฆฌ > DAG(directed Acyclic Graph)๋กœ ์ •๋ฆฌ
    • node == 1๊ฐœ์˜ sub-task
    • edge == dependency
  • #2 Planner: single node์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ„ํš ์ƒ์„ฑ, input์„ Node-Scoped Context๋กœ ์ œํ•œ
    • Node-Scoped Context = {subtask ๋ช…์„ธ + ์„ ํ–‰ node=subtask ๊ฒฐ๊ณผ + ํ•ด๋‹น ๋…ธ๋“œ์˜ local execute ๊ธฐ๋ก}
      • ๋‹ค๋ฅธ subtask์— ๋Œ€ํ•œ history๋‚˜ ๊ณผ๊ฑฐ ์‹คํŒจ ๋กœ๊ทธ ๋“ฑ ์ œ์™ธ
  • #3 Executor: ๊ณ„ํš์„ ๋‹จ๊ณ„์  ์‹คํ–‰
    • ์ „์ฒด ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ํ˜„์žฌ ๋…ธ๋“œ history๋งŒ ๋ณด๋˜, env. ์™€ ์ƒํ˜ธ์ž‘์šฉ ๋‹ด๋‹น
  • #4-1 local revision: ์‹คํŒจ์‹œ (ํ•„์š”ํ•˜๋‹ค๋ฉด) ์žฌ๊ณ„ํš์€ ๋…ธ๋“œ ๋‚ด๋ถ€์—์„œ๋งŒ ๋ฐœ์ƒ๋˜๋„๋ก (localized replanning)
  • #4-2 global revision: ์–ด๋–ค ์‹คํŒจ๊ฐ€ DAG ๊ตฌ์กฐ ์ž์ฒด์˜ ์˜ค๋ฅ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค๋ฉด ๊ทธ๋•Œ Supervisor๊ฐ€ DAG ์ˆ˜์ • (๋…ธ๋“œ ์ถ”๊ฐ€/์ œ๊ฑฐ)

Effects

  • RQ1 ๊ทธ๋ž˜์„œ ์ •๋ง ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋๋Š”๊ฐ€ (task success, constraint satisfaction, reward, accuracy)
  • RQ2 ๋น„์šฉ๋„ ์ค„์—ˆ๋‚˜ (replanning ๋•Œ๋ฌธ์— ์ƒˆ๋กœ ๋ฑ‰๋Š” token๋“ฑ)
  • Experimental Setup:
    • benchmarks:
      • TravelPlanner(constraints ์ค‘์‹ฌ tool-planning): ์—ฌ๋Ÿฌ tool ์“ฐ๋ฉด์„œ ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ(ํ•ญ๊ณต/์ˆ™์†Œ/์‹๋‹น/๊ด€๊ด‘/์ด๋™ ๋“ฑ) ์ œ์•ฝ์กฐ๊ฑด์— ๋งž์ถฐ ์—ฌํ–‰๊ณ„ํš ์งœ๊ธฐ
      • HotpotQA (interactive, multi-hop ์ถ”๋ก ): wiki์—์„œ search/lookup ํ•˜๋ฉฐ multi-hop evidence ๋ชจ์•„์„œ ๋‹ต๋ณ€
      • ScienceWorld(closed-loop ํ™˜๊ฒฝ์—์„œ ์ƒํ˜ธ์ž‘์šฉ): text ๊ฒŒ์ž„์ฒ˜๋Ÿผ env.์™€ ์ƒํ˜ธ์ž‘์šฉ ํ•˜๋ฉฐ ๊ณผํ•™ ์‹คํ—˜๊ณผ์ œ ์ˆ˜ํ–‰
    • baselines:
      • ReAct: ๋งค step๋งˆ๋‹ค think > action
      • CoT: ์ฒ˜์Œ์— plan (๊ฐ™์€ ๊ฑธ) ์„ธ์šฐ๊ณ  ๊ทธ๋Œ€๋กœ ๋ฐ€๊ณ  ๋‚˜๊ฐ€๋Š” one-shot setup
      • Plan-and-Act: high-level plan์„ ์„ธ์šฐ๊ณ  ์‹คํ–‰์ค‘ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๋ฉด ์žฌ๊ณ„ํš (global์ผ๋•Œ ๊ณ ๋น„์šฉ)
      • TDP: ์ œ์•ˆ๋ฐฉ์‹. DAG + node-scoped context + localized replanning
    • backbone: DeepSeek-3.2, GPT-4o
  • Results:
    • Tab 1: RQ1 ์ฃผ์š” ์„ฑ๋Šฅ ๋น„๊ต; deepseek-3.2, gpt-4o ๋ชจ๋‘์—์„œ TDP๊ฐ€ ํ•ญ์ƒ ์ƒ์œ„๊ถŒ
      • TravelPlanner (์ œ์•ฝ์„ ์ž˜ ์ง€์ผฐ๋А๋ƒ๊ฐ€ ํ•ต์‹ฌ): TDP๊ฐ€ ์ œ์•ฝ ๋งŒ์กฑ(ํŠนํžˆ HC)๋ฅผ ๋Œ์–ด์˜ฌ๋ฆฌ๋ฉด์„œ ์ œ์ถœ ์‹คํŒจ(=์ค‘๊ฐ„์— ๋ฌด๋„ˆ์ ธ์„œ ๋์„ ๋ชป๋‚ด๋Š”) ๋ฌธ์ œ ๊ฐ์†Œ
        • ์ œ์•ฝ ์ข…๋ฅ˜: CS(commonsense constraints; ์ƒ์‹์ˆ˜์ค€), HC(hard constraints; ๊ตฌ์ฒด์ ์ธ ์‹œ๊ฐ„/๋‚ ์งœ/๋„์‹œ/์˜ˆ์‚ฐ ๋“ฑ ์ œ์•ฝ)
        • ํ‰๊ฐ€: micro(์—ฌ๋Ÿฌ ์ œ์•ฝ ์ค‘ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ๋งŒ์กฑํ–ˆ๋Š”์ง€), macro(ํ•ด๋‹น ์นดํ…Œ๊ณ ๋ฆฌ ์ œ์•ฝ์„ ์ „๋ถ€ ๋งŒ์กฑํ–ˆ๋Š”์ง€ all-or-nothing) > final pass(๋ชจ๋“ ๊ฑธ ์ „์ฒด ํ†ต๊ณผํ–ˆ๋Š”์ง€)
      • HotpotQA (์ฆ๊ฑฐ ์ž˜ ๋ชจ์•„๋‹ค๊ฐ€ ์ตœ์ข… ๋‹ต ๋‚ด๋Š”์ง€๊ฐ€ ํ•ต์‹ฌ): TDP๊ฐ€ sub-task ๋‹จ์œ„๋กœ reasoning์„ ๊น”๋”ํ•˜๊ฒŒ ์œ ์ง€ํ•ด์„œ delivered correctness ํ–ฅ์ƒ
        • ํ‰๊ฐ€: accuracy(์ตœ์ข… ๋‹ต ๋งž๋Š”์ง€), Deli. Acc. (task completion ์ „์ œ๋กœ ๋‹ต์ด ๋งž๋Š”์ง€)
        • step-wise๋Š” history๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก drift ๋ฐœ์ƒ
        • one-shot์€ ์ตœ์ดˆ ๋ฐฉํ–ฅ์„ฑ์ด ํ‹€๋ฆฌ๋ฉด ์ทจ์•ฝ
      • ScienceWorld (env.์˜ feedback( [0,1] scale์˜ progress ๊ธฐ๋ฐ˜ reward)์„ ๋ณด๊ณ  ์ƒํ˜ธ์ž‘์šฉ ์ž˜ ํ•˜๋Š”์ง€๊ฐ€ ํ•ต์‹ฌ; ReAct ๊ณ„์—ด step-wise ๋ฐฉ์‹์ด ์œ ๋ฆฌํ•˜๋‹ค๊ณ ): ๊ฒฐ๊ณผ์ ์œผ๋กœ gpt-4o๊ฐ€ best, deepseek๋„ ๊ฒฝ์Ÿ์ˆ˜์ค€
    • Fig 3: RQ2 token ๋น„๊ต (plan-then-act vs. TDP)
      • ๋น„๊ต๊ตฐ ๋Œ€๋น„ HotpotQA์—์„œ ์•ฝ 82%, ScienceWorld์—์„œ ์•ฝ 70-75% ์ˆ˜์ค€ ํ† ํฐ ๊ฐ์†Œ
        • Plan-and-Act๋Š” deviation์ด ์ƒ๊ธฐ๋ฉด global plan์„ ๋‹ค์‹œ ์งœ๋А๋ผ ์ด๋ฏธ ๊ฒฐ์ •๋œ ๊ฒƒ๊นŒ์ง€ ์žฌ์ •๋‹นํ™”๋ฅผ ๋ฐ˜๋ณต > ํ† ํฐ ํญ์ฆ
        • TDP๋Š” deviation์ด ์ƒ๊ฒจ๋„ active node ์•ˆ์—์„œ๋งŒ(=local) replanํ•˜๋‹ˆ ๋ง์ด ๊ธธ์–ด์งˆ ์ด์œ ๊ฐ€ ์—†์Œ
    • Fig 4: case study (TravelPlanner ์‚ฌ๋ก€๋กœ decoupling ์ž‘๋™ ์ƒ์„ธ ์„ค๋“)
      • decomposition์ด ์ •๋ณด ๊ณต๋ฐฑ์„ ๋ฉ”๊พธ๊ธฐ๋„ ํ•˜๊ณ  node isolation(localization)์ด ์‹ค์ œ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด ์˜ค์—ผ์„ ๋ง‰์Œ

Personal note. long-horizon ์ตœ์‹  ์—ฐ๊ตฌ๊ฐ™์•„์„œ ์ฝ์–ด๋ณด๊ธด ํ–ˆ๋Š”๋ฐ, NL ์ธก๋ฉด์—์„œ ์ตœ์‹  ๋ฒค์น˜๋งˆํฌ ๋“ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (ํƒ€๋‹นํ•œ์ง€์™€๋Š” ๋ณ„๊ฐœ ๋ฌธ์ œ..) ์ข€ ๋ณ„๊ฑด์ด์ง€๋งŒ ์˜คํžˆ๋ ค ์ €ํฌ ์ตœ๊ทผ preference reasoning ์—ฐ๊ตฌ์—์„œ preference-scope๋ฅผ ์„ค๊ณ„ํ•ด๋ณด๊ณ  ๊ทธ์— ๋”ฐ๋ผ preference๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ๋„ ๋ฐฉ๋ฒ•์ด ๋˜๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์–ด์š”. DAG ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜๋ฉด preference trasnfer์™€ ์—ฐ๊ฒฐ๋ ์ง€๋„..