2 minute read

Meta info.
  • Authors: Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov
  • Paper: https://arxiv.org/pdf/2511.21689
  • Affiliation: Hong Kong Univ., NVIDIA
  • Published: November 26, 2025
  • Code: https://github.com/NVlabs/ToolOrchestra/

TL; DR

์ž‘์€ 8B ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ํˆด๊ณผ LLM์„ RL๋กœ ํ†ตํ•ฉ์ ์œผ๋กœ ์กฐ์ •ํ•˜์—ฌ ์ •ํ™•๋„/๋น„์šฉ/latency/์œ ์ € ์„ ํ˜ธ๋ฅผ ๋™์‹œ์— ์ตœ์ ํ™”ํ•˜๋Š” ํˆด ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆ. GPT-5๋ณด๋‹ค ์‹ธ๊ณ  ์„ฑ๋Šฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค.

image 1 image 2 image 3 image 4 image 5 image 6 image 7 image 8 image 9 image

Background

  • ์ตœ๊ทผ tool-agent๋Š” ์ข‹์€ LLM + ๊ฒ€์ƒ‰/๊ณ„์‚ฐ๊ธฐ/์ฝ”๋“œ ๋“ฑ ๊ตฌ์กฐ
    • RL๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ๋ฐฉ์‹ ์ ์šฉ: ์ •๋‹ต๊ธฐ๋ฐ˜์ด๊ฑฐ๋‚˜ ๋น„์šฉ์— ํŒจ๋„ํ‹ฐ ์ฃผ๋Š” ์ˆ˜์ค€
      • ์‚ฌ์šฉ์ž๋Š” ์‹ค์ œ ๋น„์šฉ/์†๋„/ํˆด ์ข…๋ฅ˜์— ๋”ฐ๋ฅธ ์„ ํ˜ธ๊ฐ€ ์žˆ๋Š”๋ฐ ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์ง€๋Š” ๋ชปํ•จ
  • ํ”„๋กฌํ”„ํŒ…์œผ๋กœ gpt-5๋‚˜ qwen3-8b ๋“ฑ์—๊ฒŒ ๋‹ค๋ฅธ ๋ชจ๋ธ ํ˜ธ์ถœํ•ด๋ณด๋ผ๊ณ  ํ•˜๋ฉด ์ž๊ธฐ variants๋‚˜ ๊ฐ€์žฅ ๊ฐ•ํ•œ ๋ชจ๋ธ๋งŒ ๋‚จ๋ฐœ

Problem States

์ž‘์€ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ ํˆด(๋” ํฐ ๋ชจ๋ธ ํฌํ•จ)์„ ์–ธ์ œ ์–ด๋–ป๊ฒŒ ๋ถ€๋ฅผ์ง€ ๊ฒฐ์ •ํ•˜์—ฌ ์ •๋‹ต๋ฅ ์„ ๋†’์ด๋ฉด์„œ ๋น„์šฉ์ด๋‚˜ latency๋Š” ์ค„์ด๊ณ  ์œ ์ € ์„ ํ˜ธ๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” policy ํ•™์Šต

Suggestions

ToolOrchestra โ†’ Orchestrator-8B ํ•™์Šต (qwen backbone)

  • ์ž‘์€ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ LLM์„ ํ•˜๋‚˜ ๋‘๊ณ , ์–˜๊ฐ€ ๋งค ์Šคํ…๋งˆ๋‹ค ์–ด๋–ค ํˆด(๋˜๋Š” LLM)์„ ์–ด๋–ป๊ฒŒ ์“ธ์ง€๋ฅผ MDP + RL ๊ด€์ ์—์„œ ๋ฐฐ์šฐ๊ฒŒ ํ•˜์ž
    • ๊ทธ๋•Œ ๋ณด์ƒ์€ ์ •๋‹ต๋ฅ  + ๋น„์šฉ + ์ง€์—ฐ + ์œ ์ € ํˆด ์„ ํ˜ธ๋ฅผ ํ•œ ๋ฒˆ์— ๋ฐ˜์˜ํ•˜๋Š” ๋ฒกํ„ฐ ๋‚ด์ ์œผ๋กœ ์„ค๊ณ„
  • MDP๋กœ tool orchestration ์ •์˜
    • state: ์ง€๊ธˆ๊นŒ์ง€ ๋Œ€ํ™”์™€ tool ์ด๋ ฅ = history๋ฅผ ์ „๋ถ€ ์ƒํƒœ๋กœ ๊ฐ€์ •
    • action: ์–ด๋–ค tool/llm ์“ธ์ง€, ์–ด๋–ค parameter ์“ธ์ง€, ๋ฉˆ์ถœ์ง€ ๋“ฑ์„ ๊ฒฐ์ •
    • environment: tool ์‹คํ–‰ ๊ฒฐ๊ณผ๋ฅผ observation์œผ๋กœ ๋ฐ›์Œ
    • trajectory: ํ•œ episode๋‹น ์–ผ๋งˆ๋‚˜ ๋ฌธ์ œ๋ฅผ ์ž˜ ๋งž์ถ”๊ณ  ๋ช‡๋ฒˆ์ด๋‚˜ tool์„ ์“ฐ๊ณ  ์ด ์–ผ๋งˆ๋‚˜(๋น„์šฉ์„) ์“ฐ๋Š”์ง€ ๋“ฑ์œผ๋กœ R ์„ค๊ณ„
  • tool interface:
    • ๋ชจ๋‘ json๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต์ผ
      • ๊ฐ ๋„๊ตฌ ๋‹น ์–ด๋–ค ๊ฑธ ํ•˜๋Š”์ง€(ํ•œ๊ณ„๋Š” ๋ญ”์ง€) + ํŒŒ๋ผ๋ฏธํ„ฐ์Šคํ‚ค๋งˆ ์ •์˜
      • ์ฆ‰ ๋ชจ๋“  ํˆด์„ ์จ์•ผํ•˜๋Š”์ง€ text ์ˆ˜์ค€์—์„œ ์ถ”๋ก ํ•˜๋„๋ก
    • ์ถœ๋ ฅ: CoT ๊ธฐ๋ฐ˜ JSON์„ ์จ์„œ tool์„ ๋ถ€๋ฅด๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ Observation์œผ๋กœ ์”€
  • Reward modeling: metric vector๋กœ ๊ตฌ์„ฑ (scale์ด ๋‹ฌ๋ผ๋„ ํ•œ๋ฒˆ์— ๋‹ค๋ฃจ๋„๋ก)
    • ์ •๋‹ต์„ ๋งž์ถ”๋Š” ๊ฒƒ์€ ์„ฑ๋ฆฝ ์กฐ๊ฑด์ด๊ณ , ๊ทธ ์•ˆ์—์„œ ์–ด๋–ค ํˆด์„ ์–ผ๋งˆ๋‚˜ ์ผ๋Š”์ง€ + ๋น„์šฉ + ์ง€์—ฐ์„ ์œ ์ € ์„ ํ˜ธ ๋ฒกํ„ฐ P์— ๋งž์ถฐ ์Šค์ฝ”์–ด ๋ถ€์—ฌ
    • trajectory ์ข…๋ฃŒ ํ›„ ์ •๋‹ต ์—ฌ๋ถ€
    • efficiency ์ธก๋ฉด์—์„œ ๋น„์šฉ๊ณผ Latency์— penalty
    • tool call ํšŸ์ˆ˜.. ๋“ฑ
    • user preference vector P:
      • ์ž์—ฐ์–ด ์„ ํ˜ธ ์—ฌ๋ถ€๋ฅผ preference ์Šค์นผ๋ผ p๋กœ mapping: feature๋Š” ์•ž์„  reward ์ฒ˜๋Ÿผ ๋ถ€์—ฌ
      • e.g. gpt-5๋Š” ๋น„์‹ธ๋‹ˆ๊นŒ p_{gpt-5} = 0.1, qwen3-32b์™€ math_llm์€ ์‹ธ๋‹ˆ๊นŒ ๊ฐ๊ฐ 0.7, 0.8, โ€ฆ compute ์•„๋ผ๋Š” ๊ฒƒ ์ค‘์š”ํ•˜๋‹ค โ†’ p_{\text{compute}} = 0.8
  • GRPO ์—…๋ฐ์ดํŠธ
    • ํ˜„์žฌ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ \theta๋กœ ์—ฌ๋Ÿฌ input task์— ๋Œ€ํ•ด rollout
      • ๊ฐ task๋งˆ๋‹ค ์—ฌ๋Ÿฌ trajectory ์ƒ˜ํ”Œ (temperature ์‚ด์ง ํ‚ค์›Œ์„œ)
    • ๊ฐ trajectory \tau์— ๋Œ€ํ•ด ์œ„์—์„œ ์„ค๋ช…ํ•œ ๋ณด์ƒ R(\tau) ๊ณ„์‚ฐ
    • ๊ฐ™์€ input์— ๋Œ€ํ•œ ์ƒ˜ํ”Œ๋“ค์„ โ€œgroupโ€์œผ๋กœ ๋ฌถ์–ด, group ๋‚ด์—์„œ reward๋ฅผ normalize:
      • \hat{R}(\tau) = (R(\tau) - \mu_{\text{group}})/\sigma_{\text{group}}
    • ์ด๊ฑธ advantage์ฒ˜๋Ÿผ ์จ์„œ GRPO objective๋กœ policy gradient ์—…๋ฐ์ดํŠธ

Effects

  • target tasks: HLE(phd ์ˆ˜์ค€ QA), FRAMES(wikipedia ๊ธฐ๋ฐ˜ multi-hop RAG), \tau^2-Bench(multi-domain function call bench)
  • result:
    • Tab 1 GPT-5 ๋Œ€๋น„ Orchestrator๊ฐ€ ๋” ์ •ํ™•ํ•˜๊ณ  ์•ฝ 3๋ฐฐ ์ €๋ ด = routing ์„ฑ๋Šฅ ํ›Œ๋ฅญ
      • Fig 3 Tab 15 GPT-5, GPT-5-mini, Qwen3-32B, ์ฝ”๋“œ ๋ชจ๋ธ, ์ˆ˜ํ•™ ๋ชจ๋ธ, ๊ฒ€์ƒ‰, ์ฝ”๋“œ ์ธํ„ฐํ”„๋ฆฌํ„ฐ ๋“ฑ๋“ฑ ์ „๋ฐ˜์— ๋Œ€ํ•ด ํ˜ธ์ถœ์„ ๋” ๊ท ๋“ฑํ•˜๊ฒŒ ๋ฐฐ์น˜
      • ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๋ณด๋‹ค GPT-5๋ฅผ ํ›จ์”ฌ ๋œ ์ž์ฃผ ํ˜ธ์ถœํ•˜๋ฉด์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
    • unseen tool์— ๋Œ€ํ•ด์„œ๋„ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ
      • ํ•™์Šต๋•Œ ์•ˆ์“ด LLM ๋“ฑ์— ๋Œ€ํ•ด์„œ๋„ ์ ์šฉ ๊ฐ€๋Šฅ ์ฆ‰ tool์„ ๊ฐˆ์•„๋ผ์›Œ๋„ ๊ฐ•๊ฑด
    • Fig 6 ์ฃผ์–ด์ง„ ๋น„์šฉ ํ•˜์—์„œ ๋” ๋‚˜์€ ์ •ํ™•๋„ ๋‹ฌ์„ฑ

Personal note. ํ˜„์žฌ ์ง„ํ–‰ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•˜๋Š” ์–ด๋–ค ๊ทผ๊ฑฐ๊ฐ€ ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๋Ÿฐ ๋ฌธ์ œ์˜์‹์—์„œ ์‹œ์ž‘ํ•ด์„œ ์…‹ํŒ…ํ•œ ์ œ ์—ฐ๊ตฌ์˜ ๋ฌธ์ œ๊ฐ€ ์‚ฌ์†Œํ•ด๋ณด์ด๋Š” ๊ฒƒ ๊ฐ™๊ธฐ๋„โ€ฆ