2 minute read

Meta info.

TL; DR

Task - Function์œผ๋กœ ์—ฐ๊ฒฐํ•˜๋Š” Planning ๊ธฐ๋ฐ˜์˜ multi-turn* Function Calling ํ”„๋ ˆ์ž„์›Œํฌ BUTTON ์ œ์•ˆ

image 1 image 2 image 3 image 4 image 5 image 6 image

Background

  • LLM์˜ API (external tool) call์ด ๊ธฐ๋ณธ ์ง€์›
  • ์ตœ์‹  ์—ฐ๊ตฌ์—์„œ๋Š” what (์–ด๋–ค ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•ด์•ผ ํ•˜๋Š”๊ฐ€ + ๊ทธ ํ•จ์ˆ˜ augments๋Š”?)์— ์ง‘์ค‘
  • ํ˜„์‹ค์—์„œ๋Š” user request ํ•œ ๋ฒˆ์— ๋‹คํšŒ(multi-turn) call ๋ฐœ์ƒ
    • e.g. ๋Ÿฐ๋˜์—์„œ ์—๋“ ๋ฒ„๋Ÿฌ๋กœ ๊ฐ€๋Š” ์˜ค๋Š˜์˜ ์ฒซ ๋ฒˆ์งธ ๋น„ํ–‰ํŽธ์„ ์˜ˆ์•ฝํ•˜๊ณ , ๋„์ฐฉํ–ˆ์„ ๋•Œ ๋‚ ์”จ๋„ ์•Œ๊ณ  ์‹ถ์–ด: search_flights > get_weather > book_ticket
  • multi-turn: ํ•œ ๋ฒˆ์— user ์š”์ฒญ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ๊ฐœ์˜ function call ๋‹จ๊ณ„๋ฅผ sequentialํ•˜๊ฒŒ ํ˜น์€ parallelํ•˜๊ฒŒ ๊ณ„ํš ๋ฐ ์‹คํ–‰. (user-agent๊ฐ€ ์•„๋‹˜)
    • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” user์—๊ฒŒ๋Š” ๋ณด์—ฌ์ง€์ง€ ์•Š๋Š” tool๊ณผ assistant ์‚ฌ์ด ์ฃผ๊ณ ๋ฐ›๋Š” ๊ฒƒ์œผ๋กœ turn์œผ๋กœ ์ •์˜

Problem States

multi-turn function calling ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•

  • ๊ทธ๋Ÿฐ ๋ฐ์ดํ„ฐ๋„ ์—†๊ณ 
  • ๊ทธ๋Ÿฐ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•ํ•˜๊ธฐ๋„ ์–ด๋ ค์›€: ๋ฐ์ดํ„ฐ ์„ค๊ณ„ 3๊ฐ€์ง€ ์–ด๋ ค์›€
    • compositionality : ๋‹จ์ˆœ slot filling์ด์ƒ์œผ๋กœ sub-task๋กœ ๋ถ„๋ฆฌ
    • compatibility : ๊ตฌ์ถ•ํ•œ Instruction์ด ์‹ค์ œ function ์ •์˜์™€ ๋งž์•„์•ผ
    • trajectory ํ’ˆ์งˆ : multi-turn function call์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์—ฐ๊ฒฐ
  • ๋‹จ์ˆœ ์ ‘๊ทผ์œผ๋กœ๋Š” ํ•œ๊ณ„
    • (LLMํ•œํ…Œ) ๊ทธ๋ƒฅ ํ•จ์ˆ˜ ๋ชจ์•„๋†“๊ณ  ํ•ฉ์ณ๋ด๋ผ = ์‹ค์ œ๋กœ ๋ชปํ‘ธ๋Š” ๋ฌธ์ œ ์ƒ์„ฑ
    • (LLMํ•œํ…Œ) trajectory๋ฅผ ์•Œ์•„์„œ ๋งŒ๋“ค์–ด๋ผ = ์•ž๋’ค๊ฐ€ ์•ˆ๋งž๊ณ  ์‹คํŒจ์œจ๋„ ๋†’์Œ

Suggestions

BUTTON Fig 1

  • Button Up: task > function
    • atomic task ๊ตฌ์„ฑ: ํ˜„์‹ค ์‹œ๋‚˜๋ฆฌ์˜ค ์ˆ˜์ง‘ > 1ํšŒ ํ˜ธ์ถœ๋กœ ํ’€ ์ˆ˜ ์žˆ์„๋งŒํ•œ ๊ฐ„๋‹จํ•˜๊ณ  ๋ช…๋ฃŒํ•˜๊ณ  self-containedํ•˜๊ณ  ํ•จ์ˆ˜๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š” task ์ƒ์„ฑ
    • compositional task ๊ตฌ์„ฑ: ์•ž์„  ๊ฒฐ๊ณผ๋ฅผ ๋’ค์—์„œ ๋ฐ›์•„์•ผํ•˜๊ฑฐ๋‚˜(sequential), ๋…๋ฆฝ์ ์ธ ๋‘ task ๋ฅผ ๊ฐ€๊ฐ ํ’€๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์ณ์„œ ๋‹ค์Œ์œผ๋กœ ๋„˜๊ธฐ๋„๋ก(parallel-then-sequential) ๊ตฌ์„ฑ
      • atomic task๋กœ ์‹ค์ œ ์™„๊ฒฐ์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ๋งŒ ๋‚จ๊น€
    • function ๊ตฌ์„ฑ: Descriptive(์ด๋ฆ„/์„ค๋ช…/์ž…์ถœ๋ ฅ ๋ช…๋ฃŒ), General(ํŠน์ • ๋„์‹œ์— ๊ณ ์ • X), Consistency(์—ฐ์‡„ ํ˜ธ์ถœ ์‹œ ์ถœ๋ ฅ>๋‹ค์Œ ์ž…๋ ฅ์ด ๋ฌผ๋ฆฌ๋„๋ก)
      • ์Šคํ‚ค๋งˆ: {name, description, parameters, responses, required}
      • ์‚ฌ์šฉ ๊ฐ€๋Šฅ ํ•จ์ˆ˜ ๋ชฉ๋ก์€ ์ตœ์ข…์ ์œผ๋กœ system prompt๋กœ ์ž…๋ ฅ
  • Top Down:
    • multi-agent ๊ตฌ์„ฑ:
      • user: ์งˆ๋ฌธ
      • assistant: ๋ถ„ํ•ด > ๊ณ„ํš > ํ˜ธ์ถœ
      • tool: ํ•จ์ˆ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ ๋™์ž‘ํ•ด์„œ ์ •์˜ ๋ณด๊ณ  ๊ทธ๋Ÿด๋“ฏํ•œ ๊ฒฐ๊ณผ ๋ฑ‰๋„๋ก
    • trajectory ์ˆ˜์ง‘:
      • assistant๋Š” ์ƒ๊ฐ(react ์Šคํƒ€์ผ)+ function call์„ ์ถœ๋ ฅํ•˜๋Š”๋ฐ, tool ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์•„์„œ ๋‹ค์Œ call์„ ์ด์–ด๋‚˜๊ฐ
      • ์ตœ์ข… ๋‹ต์ธ ๊ฒƒ ๊ฐ™์œผ๋ฉด ์ตœ์ข… ์‘๋‹ต์šฉ ํ•จ์ˆ˜ ํ˜ธ์ถœํ•˜๋„๋ก ์„ค๊ณ„
  • ์‹ค์ œ ์ˆ˜์ง‘
    • GPT-4o๋กœ ๋Œ๋ ค์„œ ์ตœ์ข…์ ์œผ๋กœ 8,000๊ฐœ์˜ BUTTONInstruct ์ˆ˜์ง‘
    • ๋ณ‘๋ ฌํ˜ธ์ถœ ๋งŒ๋“  ๋ฐฉ๋ฒ•: ์—ฌ๋Ÿฌ ํ•จ์ˆ˜๊ฐ€ ์„œ๋กœ ๋…๋ฆฝ์ด๋ฉด ๋ณ‘๋ ฌ ํ˜ธ์ถœ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ๊ฐ€์ •, ํ”„๋กฌํ”„ํŠธ๋กœ ์ œ์–ด.
      • ๋ณ‘๋Ÿด ํ—ˆ์šฉ: ๊ฐ€๋Šฅํ•˜๋ฉด ์„œ๋กœ ๋…๋ฆฝ ํ•จ์ˆ˜๋Š” ๋ณ‘๋ ฌ ํ˜ธ์ถœ์ด ๋œ๋‹ค
      • ๋ณ‘๋ ฌ ๊ธˆ์ง€: ํ•œ๋ฒˆ์— ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋งŒ ํ˜ธ์ถœํ•˜๊ณ  ์‘๋‹ต์„ ๊ธฐ๋‹ค๋ฆฐ ํ›„ ๋‹ค์Œ ํ•จ์ˆ˜ ํ˜ธ์ถœ
      • ๋ณ‘๋ ฌ ๊ฐ€๋Šฅ task(ํ˜น์€ ๊ทธ ๋ฐ˜๋Œ€์—ฌ๋„) 50%ํ™•๋ฅ ๋กœ ๊ธˆ์ง€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์„ž์—ˆ๋‹ค๊ณ  (๋งฅ๋ฝ๋ณด๊ณ  ์•Œ์•„์„œ ์“ฐ๋„๋ก ์œ ๋„ํ•˜๊ธฐ ์œ„ํ•œ ๊ตฌ์„ฑ)

Effects

  • Experiment setup:
    • training: SFT๋กœ open source LLM ํ•™์Šต
      • input: instruction + function def. + context
      • output: ํ•จ์ˆ˜ ํ˜ธ์ถœ + ์‘๋‹ต
      • ๋ฐ์ดํ„ฐ: BUTTONInstruct 8k + OpenHermes 100k (์ผ๋ฐ˜ ์ถ”๋ก ๋Šฅ๋ ฅ ์œ ์ง€ ๋ชฉ์ )
    • benchmark: Tool-Query, GTA(Generated Tool-using Agent)
    • metrics:
      • GA(grounding Accuracy): argument ์ž˜ ์ฑ„์šฐ๋Š”์ง€
      • Process Rate: ๊ณ„ํš๋œ ํ˜ธ์ถœ์„ ๋๊นŒ์ง€ ์ž˜ ์ˆ˜ํ–‰ํ–ˆ๋Š”์ง€ (๋„์ค‘ ํƒˆ๋ฝ ์—†์ด)
      • Success Rate: ์ตœ์ข… ์„ฑ๊ณต๋ฅ 
  • Main Results:
    • vanilla LLM์€ hard (Tool Query) ์„ฑ๊ณต๋ฅ  5~10% ์ˆ˜์ค€
    • BUTTON ํŠœ๋‹ํ•˜๋ฉด 30~60%๊นŒ์ง€ ์ƒ์Šน
      • ํฐ๋ชจ๋ธ์€ ๋” ์ž˜ ๊ฐœ์„ 
      • GPT-4o ์—๋„ ๊ทผ์ ‘
    • ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•์—์„œ buttom up๋งŒ ํ•˜๊ฑฐ๋‚˜ top down๋งŒ ํ•œ ๊ฒฝ์šฐ ์„ฑ๋Šฅ ๊ธ‰๋ฝ โ†’ ๋‘˜ ๋‹ค ๊ณ ๋ คํ•˜๋Š”๊ฒŒ ํ•„์š”ํ–ˆ๋‹ค๊ณ  ์—ญ์„ค Fig 3

Personal note. ์ „๋‹ฌํ•˜๋Š” ๋ฉ”์„ธ์ง€๋Š”ย function calling์—๋„ planning์ด ํ•„์š”ํ•˜๋‹คย ์ด๊ธด ํ•œ๋ฐ, ์‹ค์ œ ์ €์ž๋“ค์ด ์—ฐ๊ตฌํ•œ ๋‚ด์šฉ์„ ๋‚ฉ์ž‘ํ•˜๊ฒŒ ํ•ด์„ํ•˜๋ฉด instruction-tuning์šฉ synthetic data๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค ์ •๋„๋กœ ์š”์•ฝํ•  ์ˆ˜ ์žˆ๊ณ , ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋„ SFTํ•ด๋ผ ์ˆ˜์ค€์— ๋จธ๋ฌด๋ฆ…๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ICLR์— accept(5,8,8,5)๋œ ํŠน๋ณ„ํ•œ ์ ์ด ์žˆ๋Š”์ง€ ์‚ดํ”ผ๊ณ  ์‹ถ์–ด์„œ ํ™•์ธ๋Š”๋ฐ ๊ธ€์ด ๊ฝค ๋ช…๋ฃŒํ•˜๋‹ค๋Š” ์ธ์ƒ์€ ํ™•์‹คํ•˜์ง€๋งŒ ๊ทธ ๋ฐ–์— ๋‚ด์šฉ์€ ์•„์ง๋„ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹คย ๐Ÿค”ย ๊ฐœ์ธ์ ์œผ๋กœ๋Š” ์ฒซ๋ฒˆ์งธ ๋ฆฌ๋ทฐ์–ด๊ฐ€ ์ง€์ ํ–ˆ๋˜ task์˜ ๋…ผ๋ฆฌ์  ํ•ฉ์„ฑ ํ’ˆ์งˆ์— ๋Œ€ํ•œ ์˜๋ฌธ, ์‹ค์ œ ํ•จ์ˆ˜ ํ˜ธ์ถœ์€ ์•„๋‹Œ ์  (์ด ์—ญ์‹œ ํ•ฉ์„ฑ ์ˆ˜์ค€์ด์—ˆ๋˜ ์ ) ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์— ๋Œ€ํ•œ ๋ฌธ์˜์— ๊ณต๊ฐ์ด ๋” ๊ฐ‘๋‹ˆ๋‹ค. ์ €์ž๋Š” ๋ฐ˜๋ฐ• ํ•˜๊ธด ํ–ˆ๋Š”๋ฐย ์‚ฌ์ „์—ย ์ƒ์„ฑ์ „๋žต๋ถ€ํ„ฐ ์ž˜๋งŒ๋“ค๋„๋ก ๊ตฌ์„ฑํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—ย ์‚ฌํ›„์—ย ๋ณ„๋„ ๋ณต์žกํ•œ ํ•„ํ„ฐ๋ง ๋“ฑ ๊ฒ€์ฆ์ด ํ•„์š”์—†๋‹ค๋Š” ์ฃผ์žฅ์ด ๋จนํ˜€์„œ ์ ์ˆ˜๊ฐ€ ์˜ค๋ฅธ ๊ฒƒ๋„ ์‹ ๊ธฐํ•˜๋„ค์š”. (๋ฌผ๋ก  ๊ทธ ๋ฐ˜๋ฐ•๋งŒ ์žˆ๋˜ ์ ์€ ์•„๋‹ˆ์ง€๋งŒ, ์ด ๋ฆฌ๋ทฐ์–ด๋Š” ์ตœ์ข… 5์ ์œผ๋กœ ๋งˆ๋ฌด๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.)