6 minute read

Meta info.
  • Authors: Xu Huang, Yuefeng Huang, Weiwen Liu, Xingshan Zeng, Yasheng Wang, Ruiming Tang, Hong Xie, Defu Lian
  • Paper: https://arxiv.org/abs/2505.04072
  • Code/Data: PTBench
  • Affiliation: University of Science and Technology of China, Shanghai Jiao Tong University, Huawei Noah's Ark Lab
  • Published: May 7, 2025 (arXiv preprint, cs.CL) / ICLR 2026 Submission 17112 (Rejected)
Slide 1 Slide 2
Slide 3 Slide 4
Slide 5 Slide 6
Slide 7 Slide 8

TL; DR

Personalized Tool Invocation์ด๋ผ๋Š” ์ƒˆ task(Tool Preference + Profile-dependent Query)๋ฅผ ์ •์˜ํ•˜๊ณ , LLM ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ํ•ฉ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ PTool๊ณผ ์ฒซ ๋ฒค์น˜๋งˆํฌ PTBench๋ฅผ ๊ตฌ์ถ•ํ•ด Qwen2.5-7B๋ฅผ SFT๋งŒ์œผ๋กœ ๊ฐœ์ธํ™”๋œ tool calling์—์„œ GPT-4-turbo๊นŒ์ง€ ๋Šฅ๊ฐ€์‹œํ‚ด

Review Video

Figure 1: Personalized Tool Invocation ์˜ˆ์‹œ (a) Tool Preference (b) Profile-dependent Query Figure 2: PTool ๋ฐ์ดํ„ฐ ํ•ฉ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ 3-stage ํŒŒ์ดํ”„๋ผ์ธ Table 1: ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹ ํ†ต๊ณ„ Table 2: PTBench main results (baseline ๋น„๊ต) Table 3 / Figure 3: user profile ablation Figure 4: Error Analysis (6 error category) Figure 5 / Figure 6: model scaling ๋ฐ general capability

Background

  • tool invocation(=tool calling)์€ ํ›„๋ณด ๋„๊ตฌ ์ค‘ tool selection + query๋กœ๋ถ€ํ„ฐ parameter extraction ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ
    • ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ fundamental capability ํ–ฅ์ƒ์— ์ง‘์ค‘; tool syntax ์ค€์ˆ˜, function ์ดํ•ด, explicit instruction ํ•ด์„, parameter ์ถ”์ถœ
    • tuning-free: prompt/few-shot, tool ๋ฌธ์„œ rewriting, description ์••์ถ•, multi-agent decomposition
    • tuning-based: tool์šฉ special token ์ถ”๊ฐ€(Toolformer, ToolkenGPT), ๊ฐ•๋ ฅํ•œ LLM์œผ๋กœ tool-calling sample์„ ํ•ฉ์„ฑํ•ด ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ์— distillation (Qin et al., 2024; Liu et al., 2025)
  • personalized LLM ๊ณ„์—ด์€ ๋”ฐ๋กœ ๋ฐœ์ „; ์ฃผ๋กœ personalized text generation, recommender system ์‘์šฉ
    • LLM์ด recsys์˜ content interpreter / knowledge base / explainer / ์ง์ ‘ recommender ์—ญํ• 
    • user profile์€ prompt ๋˜๋Š” hidden representation์œผ๋กœ ์ฃผ์ž…
  • ํ•ต์‹ฌ ๊ณต๋ฐฑ: tool learning์— personalization์„ ๊ฒฐํ•ฉํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋ถ€์žฌ
    • ๊ธฐ์กด tool ๋ฒค์น˜๋งˆํฌ๋Š” โ€œ์ด query๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ์ธ๊ฐ€โ€๋งŒ ์ •๋‹ต ๊ธฐ์ค€์œผ๋กœ ๋ด„
    • ๊ธฐ๋Šฅ์ ์œผ๋กœ ๋™์ผํ•œ ์—ฌ๋Ÿฌ ๋„๊ตฌ๊ฐ€ ์กด์žฌํ•˜๊ณ , ์‚ฌ์šฉ์ž๊ฐ€ ๊ทธ์ค‘ ํŠน์ • ๋„๊ตฌ๋ฅผ ์„ ํ˜ธํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์€ ๋ฌด์‹œ
    • query๊ฐ€ underspecified์ผ ๋•Œ ๋ˆ„๋ฝ ์ธ์ž๋ฅผ ์‚ฌ์šฉ์ž ์ •๋ณด๋กœ ๋ฉ”์šฐ๋Š” ๋Šฅ๋ ฅ๋„ ํ‰๊ฐ€๋œ ์  ์—†์Œ

Problem States

๊ธฐ์กด tool invocation์€ ๋„๊ตฌ ์„ ํƒ๊ณผ ์ธ์ž ๊ฐ’์ด ์˜ค์ง query ์•ˆ์˜ ์ •๋ณด๋งŒ์œผ๋กœ ๊ฒฐ์ •๋œ๋‹ค๊ณ  ๊ฐ€์ •; ํ˜„์‹ค์˜ implicit user intent๋ฅผ ๋ฐ˜์˜ํ•˜๋ ค๋ฉด ๋‘ ๊ฐ€์ง€ ์กฐ๊ฑด์ด ์ถ”๊ฐ€๋กœ ์ถฉ์กฑ๋ผ์•ผ ํ•จ

  • ์กฐ๊ฑด 1 ) Tool Preference: ๊ธฐ๋Šฅ์ด ๋™์ผํ•œ ๋ณต์ˆ˜ ๋„๊ตฌ๊ฐ€ ์žˆ์„ ๋•Œ, ๊ฐ™์€ ์‚ฌ์šฉ์ž๋ผ๋„ query ๋งฅ๋ฝ์— ๋”ฐ๋ผ ์„ ํ˜ธ ๋„๊ตฌ๊ฐ€ ๋‹ฌ๋ผ์ง
    • e.g. ๊ณ ๊ฐ€ ์ „์ž์ œํ’ˆ์€ A/S ์ข‹์€ ํ”Œ๋žซํผ, ์ €๊ฐ€ ์ƒํ•„ํ’ˆ์€ ๋น ๋ฅธ ๋ฐฐ์†ก ํ”Œ๋žซํผ; ๋„๊ตฌ ์„ ํƒ ์ž์ฒด๊ฐ€ ์„ ํ˜ธ ์‹ ํ˜ธ
    • ์‚ฌ์šฉ์ž ์†์„ฑ(๋‚˜์ด, ๊ด€์‹ฌ์‚ฌ, ๊ตฌ๋งค ํ–‰๋™)์œผ๋กœ๋ถ€ํ„ฐ reasoning์ด ํ•„์š”
  • ์กฐ๊ฑด 2 ) Profile-dependent Query: query๊ฐ€ ํ•ต์‹ฌ ์ธ์ž๋ฅผ ์ƒ๋žต โ†’ query๋งŒ์œผ๋กœ๋Š” ๋ชจ๋“  parameter ์ถ”์ถœ ๋ถˆ๊ฐ€
    • e.g. โ€œKFC์—์„œ ํ–„๋ฒ„๊ฑฐ ์ฃผ๋ฌธํ•ด์ค˜โ€์— ๋ฐฐ์†ก์ง€/์—ฐ๋ฝ์ฒ˜/ํฌ๋ง์‹œ๊ฐ„ ๋ˆ„๋ฝ โ†’ user profile์—์„œ ์ถ”๋ก ํ•ด์•ผ ํ˜ธ์ถœ ์™„์„ฑ

Suggestions

์ˆ˜์‹ ์ •์˜

  • general tool invocation: query $q$์™€ ํ›„๋ณด ๋„๊ตฌ ์ง‘ํ•ฉ $T$๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋„๊ตฌ $t^i$๋ฅผ ๊ณ ๋ฅด๊ณ  ์ธ์ž๋ฅผ ์ฑ„์›Œ solution $A = [(t^i, a^i_1, \cdots, a^i_m), \cdots]$ ์ƒ์„ฑ
  • Tool Preference (Def 3.1): ์‚ฌ์šฉ์ž $u$๊ฐ€ $q_1$์—์„œ๋Š” $t_1$์„, $q_2$์—์„œ๋Š” $t_2$๋ฅผ ์„ ํ˜ธํ•˜๋˜ ๋‘ query ๋ชจ๋‘ $t_1, t_2$๋กœ ํ’€ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ

    \[t_1 \succ_{(u,q_1)} t_2 \;;\; t_2 \succ_{(u,q_2)} t_1\]
    • ์ฆ‰ โ€œ์–ด๋–ค ๋„๊ตฌ๋“  ํ’€ ์ˆ˜๋Š” ์žˆ์ง€๋งŒ ์‚ฌ์šฉ์ž๋Š” ํŠน์ • ๋„๊ตฌ๋ฅผ ๊ณ ๋ฅธ๋‹คโ€๋Š” ์„ ํ˜ธ์˜ ํ˜•์‹์  ์ •์˜
  • Profile-dependent Query (Def 3.2): user profile $P_u$, query $q$, solution $A$์— ๋Œ€ํ•ด $\exists\, \alpha \in A$์ด๋ฉด์„œ $\alpha \in P_u$์ด๊ณ  $\alpha \notin q$์ธ ๊ฐ’์ด ์กด์žฌํ•˜๋ฉด ๊ทธ query๋Š” profile-dependent

    \[\exists\, \alpha \in A,\ \alpha \in P_u,\ \alpha \notin q\]
    • ์ •๋‹ต ์ธ์ž ์ค‘ ์ผ๋ถ€๊ฐ€ query์—” ์—†๊ณ  profile์—๋งŒ ์žˆ๋Š” ์ƒํ™ฉ
    • profile ์ฐธ์กฐ ์—†์ด๋Š” ํ˜ธ์ถœ ์™„์„ฑ ๋ถˆ๊ฐ€

Tool Generation

  • ToolACE ๋ฅ˜์˜ API Tree ๊ตฌ์กฐ๋กœ ๋„๊ตฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•ฉ์„ฑ
    • 1st level node = ์ผ์ƒ scenario(shopping, food delivery, office ๋“ฑ)
    • depth-first expansion์œผ๋กœ ๊ธฐ๋Šฅ์„ ์žฌ๊ท€ ์„ธ๋ถ„ํ™” โ†’ leaf node = ๊ตฌ์ฒด์  API description
  • ํ•ต์‹ฌ ์ถ”๊ฐ€: 2nd level์— platform ๊ฐœ๋… ๋„์ž…
    • ํ•œ scenario ์•ˆ์— ์„œ๋กœ ํŠน์„ฑ์ด ๋‹ค๋ฅธ ๋ณต์ˆ˜ platform ์ƒ์„ฑ; e.g. video entertainment์—์„œ YouTube(๋กฑํผ) vs TikTok(์ˆํผ)
    • ์ด ๋•๋ถ„์— ๊ธฐ๋Šฅ์€ ํ˜ธํ™˜๋˜์ง€๋งŒ ํŠน์„ฑ์ด ๋‹ค๋ฅธ ๋„๊ตฌ๋“ค์ด ์ƒ๊ฒจ Tool Preference ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ๋งŒ๋“ค์–ด์ง (๋‹จ์ˆœ API Tree๋งŒ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅ)

User Profile Construction

  • ์„ธ ๊ฐ€์ง€ ์ œ์•ฝ ๋™์‹œ ์ถฉ์กฑ ๋ชฉํ‘œ; (1) tool ์„ ํƒ๊ณผ ์—ฐ๊ฒฐ๋˜๋Š” feature set ์ •์˜ (2) unseen user ์ผ๋ฐ˜ํ™”๋ฅผ ์œ„ํ•œ ๋‹ค์–‘์„ฑ (3) ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ basic/behavioral ์ •๋ณด๋งŒ ํฌํ•จ, ์„ธ๋ถ€ ์‹ฌ๋ฆฌ ์†์„ฑ์€ ๋ฐฐ์ œ
  • Bottom-up Feature Tree Construction: tool-driven hierarchical clustering
    • leaf node = platform ํŠน์„ฑ + tool parameter
    • LLM clustering์œผ๋กœ ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด parameter๋ฅผ ์žฌ๊ท€ ๋ณ‘ํ•ฉํ•ด ์ƒ์œ„ feature๋กœ ์š”์•ฝ; parent node ์ˆ˜๊ฐ€ ์ž„๊ณ„์น˜์— ๋“ค ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต
    • feature๋ฅผ ๋‘ ๋ถ€๋ฅ˜๋กœ ๊ตฌ๋ถ„; explicit basic feature(age, gender ๋“ฑ ์ง์ ‘ ๊ด€์ฐฐ) vs implicit preference(shopping preference ๋“ฑ ์ž ์žฌ, ์ดํ›„ behavior ์ƒ์„ฑ์—๋งŒ ์‚ฌ์šฉ)
  • Top-down Characteristic Assignment: feature ๊ฐ’์„ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ฐฐ์ •ํ•ด distinct profile ์ƒ์„ฑ
    • ํ•œ ๋ช…์”ฉ N๋ฒˆ ๋ฐฐ์ •(๋น„์šฉโ†‘ยท๋ฐ˜๋ณต ํšŒํ”ผ ์–ด๋ ค์›€) ๋˜๋Š” N๋ช…์„ ํ•œ ๋ฒˆ์— ๋ฐฐ์ •(context length ํ•œ๊ณ„)์˜ ๋‹จ์  ํšŒํ”ผ
    • tree ๊ตฌ์กฐ ๊ธฐ๋ฐ˜ hierarchical ๋ฐฐ์ •; $l$ ์ธต node์— $k_l$๊ฐœ ๊ฐ’์„ ๋™์‹œ ๋ฐฐ์ •, $l{+}1$ ์ธต์€ ๊ฐ parent ๊ฐ’๋งˆ๋‹ค $k_{l+1}$๊ฐœ ์ƒ์„ฑ
    • depth $L$ tree์—์„œ ์ตœ์ข… $N = \prod_{l=0}^{L} k_l$๊ฐœ profile; ๋งค๋ฒˆ์˜ $k_l$์€ $N$๋ณด๋‹ค ํ›จ์”ฌ ์ž‘์•„ ํ•œ ๋ฒˆ์— ๋‹ค์–‘ํ•œ feature ์ƒ์„ฑ ๊ฐ€๋Šฅ
  • User Behavior Generation: LLM role-playing์œผ๋กœ profile+platform ํŠน์„ฑ ๊ธฐ๋ฐ˜ ํ–‰๋™ ์ƒ์„ฑ
    • e.g. budget-conscious ์‚ฌ์šฉ์ž๋ฉด โ€œAmazon์—์„œ ๋“ฑ์‚ฐ ๋ฐฐ๋‚ญ ๊ฒ€์ƒ‰โ€, โ€œWalmart์—์„œ $30 ์ปคํ”ผ ๊ตฌ๋งคโ€ ์‹ interaction ์ƒ์„ฑ
    • implicit preference๋Š” task ์ˆ˜ํ–‰ ์ค‘์—๋Š” ๋ชจ๋ธ์— ์•ˆ ๋ณด์ด์ง€๋งŒ, solution ์ƒ์„ฑ ํ”„๋กฌํ”„ํŠธ์—๋Š” ์‚ฝ์ž…๋ผ ์ •๋‹ต ๋„๊ตฌ ์„ ํƒ์„ ๊ฒฐ์ •

Query and Solution Generation

  • query-solution pair๋Š” multi-agent ํ˜‘์—…์œผ๋กœ ์ƒ์„ฑ; user agent + assistant agent
    • user agent: profile(basic+implicit feature) ๊ธฐ๋ฐ˜ role-play๋กœ query ์ƒ์„ฑ; platform ์ •๋ณด๋ฅผ ํ”„๋กฌํ”„ํŠธ์— ๋„ฃ์–ด platform ์„ ํ˜ธ๊ฐ€ ๋ฐ˜์˜๋œ query ์œ ๋„
    • ๋™์‹œ์— query์— profile ์ •๋ณด๋ฅผ ๋…ธ์ถœํ•˜์ง€ ๋ง๋ผ๊ณ  ์ง€์‹œ โ†’ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ profile-dependent query๊ฐ€ ๋งŒ๋“ค์–ด์ง
    • assistant agent: ํ•ด๋‹น query์— ๋Œ€ํ•œ tool invocation solution ์ƒ์„ฑ
  • two-tier verification์œผ๋กœ ์ •๋‹ต ์‹ ๋ขฐ์„ฑ ํ™•๋ณด
    • rule-based validation: ํ˜•์‹ ๊ฒ€์‚ฌ; ํ•ด์„ ๋ถˆ๊ฐ€/hallucinated ๋„๊ตฌยท์ธ์ž ์ฐจ๋‹จ
    • model-based verification: profile+query+solution triple์„ LLM์— ๋„ฃ์–ด parameter ์ •ํ™•์„ฑยทhallucinationยทquery ํ•ด์†Œ ์—ฌ๋ถ€ ์ ๊ฒ€
    • ์ดํ›„ ์‚ฌ๋žŒ์ด ์ง์ ‘ ์ธ์ž ์ •ํ™•์„ฑ ๊ฒ€์ˆ˜; ๊ฐ parameter๋ฅผ profile-related ๋˜๋Š” query-related๋กœ ๋ผ๋ฒจ๋ง โ†’ ์ •๋ฐ€ error feedback ๊ฐ€๋Šฅ
  • ๊ฒฐ๊ณผ๋ฌผ์ด PTBench (Tab 1); GPT-4-turbo๋กœ ํ•ฉ์„ฑ, 5 scenario(shopping/takeout/entertainment/work/travel) ร— 3 platform ร— 24 API
    • train 74 user / 7,096 query, test 80 user / 1,083 query (์‚ฌ๋žŒ ๊ฒ€์ˆ˜)
    • test๋Š” trained(ํ•™์Šต์— ๋“ฑ์žฅํ•œ 74 user ์ค‘ ์ผ๋ถ€ query 474๊ฐœ) + untrained(ํ•™์Šต์— ์—†๋˜ 6 user์˜ query 609๊ฐœ)๋กœ ๋ถ„๋ฆฌ โ†’ ์ผ๋ฐ˜ํ™” ์ธก์ •

Effects

  • Experimental setup
    • ํ•™์Šต ๋ชจ๋ธ: Qwen2.5-7B-Instruct๋ฅผ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ SFT; ์ž์› ์ œ์•ฝ์ƒ LoRA (rank 8, alpha 16, lr $10^{-4}$, cosine scheduler, warmup 0.1, epoch 1)
    • metric (8์ข…, ๋ชจ๋‘ exact-match accuracy): Format acc(instruction following), Platform acc(=tool preference ์ธ์‹), Query-related / Profile-related param-value acc, Tool-name / Tool-param / Tool-value acc, Overall(trained/untrained/overall)
    • baseline: API ๋ชจ๋ธ(GPT-4-turbo, GPT-4o, DeepSeek-v3/r1, Qwen-max, Claude-3.5-sonnet), OSS ๋ชจ๋ธ(DeepSeek-R1-Distill-Llama-8B/Qwen-7B, Qwen2.5-7B-Instruct, Llama-3.1-8B, Mistral-7B-v0.3), tool ํŠนํ™” ๋ชจ๋ธ(Hammer2.1-7B, ToolACE-8B, watt-tool-8B, xLAM-7B-r)
  • Results
    • Tab 2 ์ „์ฒด ์„ฑ๋Šฅ: ํ•™์Šต ๋ชจ๋ธ(Ours) overall 0.2678๋กœ ์ตœ๊ณ ; ์ตœ๊ณ  baseline GPT-4-turbo 0.1847, base Qwen2.5-7B 0.0738. Platform(tool preference) acc 0.7374๋กœ ์••๋„์ (GPT-4-turbo 0.5484, Claude-3.5-sonnet 0.5826, base Qwen2.5-7B 0.3795)
    • Finding 1: API ๋ชจ๋ธ์ด OSS ๋ชจ๋ธ์„ ์ „๋ฐ˜์ ์œผ๋กœ ์ƒํšŒ(parameter scale ํšจ๊ณผ)
    • Finding 2: ๋Œ€๋ถ€๋ถ„ ๋ชจ๋ธ์ด tool preference์— ์•ฝํ•จ(๋‚ฎ์€ platform acc) ; SOTA์ธ GPT-4-turbo๋„ 0.55 ์ˆ˜์ค€ โ†’ LLM์ด profile์— ๋งž๋Š” ๋„๊ตฌ ์„ ํƒ์— ์‹คํŒจํ•จ์„ ์‹œ์‚ฌ ; Ours๋Š” ๊ฑฐ์˜ ์ „ ํ•ญ๋ชฉ์—์„œ baseline ์ƒํšŒ
    • Finding 3: ํ•™์Šต ํ›„ ์ „ task ํฐ ํ–ฅ์ƒ ; ํŠนํžˆ Tool Preference ํ–ฅ์ƒํญ์ด ๊ฐ€์žฅ ํผ(0.3795โ†’0.7374). untrained user์—์„œ๋„ ํ–ฅ์ƒ โ†’ ๋‹จ์ˆœ ์•”๊ธฐ๊ฐ€ ์•„๋‹Œ ์ผ๋ฐ˜ํ™” ์ฃผ์žฅ
    • Finding 4: ๋ชจ๋“  ๋ชจ๋ธ์ด profile-dependent param์„ query-dependent param๋ณด๋‹ค ๋‚ฎ๊ฒŒ ๋งžํž˜ โ†’ profile๋กœ๋ถ€ํ„ฐ ์ถ”๋ก ์ด ๋” ์–ด๋ ค์›€. Ours๋Š” query param์—์„  GPT-4-turbo์— ๋ชป ๋ฏธ์น˜์ง€๋งŒ profile param์—์„  ๋” ํฐ ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€
    • Tab 3 Fig 3 user profile ablation: All 0.2678 / w/o Basic 0.1606 / w/o History 0.2493 / w/o Basic&History 0.0674 ; history ์—†์œผ๋ฉด tool preference ๊ธ‰๋ฝ, basic feature ์—†์œผ๋ฉด (profile-dependent query ํƒ“์—) tool invocation ๊ธ‰๋ฝ โ†’ ๋‘ ์ •๋ณด๊ฐ€ ์ƒ๋ณด์ 
    • Fig 4 error analysis: 6 ๋ฒ”์ฃผ(wrong/missing/excessive tool, wrong/missing/excessive param) ; ๋„๊ตฌ ์„ ํƒ๋ณด๋‹ค parameter ์ฑ„์šฐ๊ธฐ๊ฐ€ ๋” ์–ด๋ ค์šด ๋ฌธ์ œ๋กœ P-wrong์ด ์ง€๋ฐฐ์ (GPT-4-turbo 66.5%, base Qwen 58.4%, Ours 79.5%). ํ•™์Šต ํ›„ tool ์„ ํƒ ์˜ค๋ฅ˜ ๋น„์ค‘์ด ์ค„์–ด ์ƒ๋Œ€์ ์œผ๋กœ param ์˜ค๋ฅ˜ ๋น„์ค‘์ด ์ปค์ง
    • Fig 5 model scaling: Qwen2.5 0.5/1.5/3/7B ์ค‘ 3Bยท7B๋งŒ ํ•™์Šต์œผ๋กœ ํฐ ํ–ฅ์ƒ, 0.5Bยท1.5B๋Š” ๋ฏธ๋ฏธ โ†’ personalized tool invocation์€ ์ผ์ • capacity๊ฐ€ ํ•„์š”ํ•œ high-level ๋Šฅ๋ ฅ
    • Fig 6 general capability: MMLU/HumanEval/GSM8K/CommonsenseQA/BFCL non-live์—์„œ base Qwen2.5-7B ๋Œ€๋น„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์Œ ; BFCL non-live๋Š” ์˜คํžˆ๋ ค ํ–ฅ์ƒ โ†’ ๊ฐœ์ธํ™” ํ•™์Šต์ด ์ผ๋ฐ˜ ๋Šฅ๋ ฅ์„ ํ•ด์น˜์ง€ ์•Š์Œ

Personal note. iclr 2026 reject๋˜์—ˆ๊ณ , ํ˜„์žฌ rebuttal์—์„œ ๋ฆฌ๋ทฐ์–ด๊ฐ€ ์–ธ๊ธ‰ํ•ด์„œ ์‚ดํˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ์ฃผ์žฅ์„ ๋‹น์—ฐํžˆ ๊ณง์ด๊ณง๋Œ€๋กœ ๋ฐ›๊ธฐ์—” ๊ฑธ๋ฆฌ๋Š” ์ง€์ ์ด ๋ถ„๋ช…ํ•œ๋ฐ, ๋ฌด์—‡๋ณด๋‹ค inference ์‹œ์ ์— user profile์ด ๊ตฌ์กฐํ™”๋œ JSON์œผ๋กœ ํ†ต์งธ๋กœ ๋“ค์–ด๊ฐ„๋‹ค๋Š” ์„ค์ •์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ฃผ๋œ challenge๋Š” ์ด๋ฏธ ์ •๋ฆฌ๋œ profile์—์„œ ์•Œ๋งž์€ ํ•„๋“œ๋ฅผ ์ฐพ์•„ ๋„ฃ๋Š” ์ชฝ์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ€๊ณ , SFT๊ฐ€ ์ž˜ ๋“ฃ๋Š” ๊ฒƒ๋„ ๊ทธ ๋งคํ•‘์ด ๋น„๊ต์  ์™ธ์šฐ๊ธฐ ์‰ฌ์šด ํ˜•ํƒœ์ด๊ธฐ ๋•Œ๋ฌธ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ์˜์‹ฌ์ด ๋“ญ๋‹ˆ๋‹ค. single-session ๋‹จ๋ฐœ query๋ผ ์„ ํ˜ธ์˜ ์‹œ๊ฐ„์  ๋ณ€ํ™”๋‚˜ ์„ธ์…˜ ๊ฐ„ ๋ถ„์‚ฐ์ด ๋น ์ ธ ์žˆ๋‹ค๋Š” ์ ๋„ generalization ์ฃผ์žฅ์„ ์˜์‹ฌํ•˜๊ฒŒ ํ•˜๊ณ ์š”. overall accuracy๊ฐ€ ํ•™์Šต ๋ชจ๋ธ์กฐ์ฐจ 0.27 ์–ธ์ €๋ฆฌ์ธ ๊ฑด exact-match๋ผ ๋” ์—„๊ฒฉํ•œ ์Šค์ฝ”์–ด๋ง์ด ์›์ธ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ โ€œ๊ฐœ์ธํ™”๋ฅผ tool calling์— ์ฒ˜์Œ ๋Œ์–ด๋“ค์˜€๋‹คโ€๋Š” ํ”„๋ ˆ์ด๋ฐ์€ ์ตœ์ดˆ๊ฐ€ ์˜์‹ฌ์Šค๋Ÿฝ๋‹ค๋Š” ๋ถ€๋ถ„ ์ด๋ฏธ ๋ฆฌ๋ทฐ์–ด๋“ค์ด ์–ธ๊ธ‰ํ–ˆ๊ณ , ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•์‹œ ํ•ฉ์„ฑ์— ์ผ๋˜ ๋ชจ๋ธ์„ ๋˜ ์“ด ์  ์—ญ์‹œ ์ง€์ ๋์Šต๋‹ˆ๋‹ค.