Advancing and Benchmarking Personalized Tool Invocation for LLMs

May 29, 2026 6 minute read

Meta info.

Authors: Xu Huang, Yuefeng Huang, Weiwen Liu, Xingshan Zeng, Yasheng Wang, Ruiming Tang, Hong Xie, Defu Lian
Paper: https://arxiv.org/abs/2505.04072
Code/Data: PTBench
Affiliation: University of Science and Technology of China, Shanghai Jiao Tong University, Huawei Noah's Ark Lab
Published: May 7, 2025 (arXiv preprint, cs.CL) / ICLR 2026 Submission 17112 (Rejected)

TL; DR

Personalized Tool Invocation이라는 새 task(Tool Preference + Profile-dependent Query)를 정의하고, LLM 기반 데이터 합성 프레임워크 PTool과 첫 벤치마크 PTBench를 구축해 Qwen2.5-7B를 SFT만으로 개인화된 tool calling에서 GPT-4-turbo까지 능가시킴

Review Video

Figure 1: Personalized Tool Invocation 예시 (a) Tool Preference (b) Profile-dependent Query Figure 2: PTool 데이터 합성 프레임워크 3-stage 파이프라인 Table 1: 합성 데이터셋 통계 Table 2: PTBench main results (baseline 비교) Table 3 / Figure 3: user profile ablation Figure 4: Error Analysis (6 error category) Figure 5 / Figure 6: model scaling 및 general capability

Background

tool invocation(=tool calling)은 후보 도구 중 tool selection + query로부터 parameter extraction 두 단계로 구성
- 기존 연구는 거의 fundamental capability 향상에 집중; tool syntax 준수, function 이해, explicit instruction 해석, parameter 추출
- tuning-free: prompt/few-shot, tool 문서 rewriting, description 압축, multi-agent decomposition
- tuning-based: tool용 special token 추가(Toolformer, ToolkenGPT), 강력한 LLM으로 tool-calling sample을 합성해 경량 모델에 distillation (Qin et al., 2024; Liu et al., 2025)
personalized LLM 계열은 따로 발전; 주로 personalized text generation, recommender system 응용
- LLM이 recsys의 content interpreter / knowledge base / explainer / 직접 recommender 역할
- user profile은 prompt 또는 hidden representation으로 주입
핵심 공백: tool learning에 personalization을 결합한 연구가 부재
- 기존 tool 벤치마크는 “이 query를 풀 수 있는 도구인가”만 정답 기준으로 봄
- 기능적으로 동일한 여러 도구가 존재하고, 사용자가 그중 특정 도구를 선호한다는 사실은 무시
- query가 underspecified일 때 누락 인자를 사용자 정보로 메우는 능력도 평가된 적 없음

Problem States

기존 tool invocation은 도구 선택과 인자 값이 오직 query 안의 정보만으로 결정된다고 가정; 현실의 implicit user intent를 반영하려면 두 가지 조건이 추가로 충족돼야 함

조건 1 ) Tool Preference: 기능이 동일한 복수 도구가 있을 때, 같은 사용자라도 query 맥락에 따라 선호 도구가 달라짐
- e.g. 고가 전자제품은 A/S 좋은 플랫폼, 저가 생필품은 빠른 배송 플랫폼; 도구 선택 자체가 선호 신호
- 사용자 속성(나이, 관심사, 구매 행동)으로부터 reasoning이 필요
조건 2 ) Profile-dependent Query: query가 핵심 인자를 생략 → query만으로는 모든 parameter 추출 불가
- e.g. “KFC에서 햄버거 주문해줘”에 배송지/연락처/희망시간 누락 → user profile에서 추론해야 호출 완성

Suggestions

수식 정의

general tool invocation: query $q$ 와 후보 도구 집합 $T$ 가 주어지면 도구 $t^i$ 를 고르고 인자를 채워 solution $A = [(t^i, a^i_1, \cdots, a^i_m), \cdots]$ 생성
Tool Preference (Def 3.1): 사용자 $u$ 가 $q_1$ 에서는 $t_1$ 을, $q_2$ 에서는 $t_2$ 를 선호하되 두 query 모두 $t_1, t_2$ 로 풀 수 있는 경우
\[t_1 \succ_{(u,q_1)} t_2 \;;\; t_2 \succ_{(u,q_2)} t_1\]
- 즉 “어떤 도구든 풀 수는 있지만 사용자는 특정 도구를 고른다”는 선호의 형식적 정의
Profile-dependent Query (Def 3.2): user profile $P_u$ , query $q$ , solution $A$ 에 대해 $\exists\, \alpha \in A$ 이면서 $\alpha \in P_u$ 이고 $\alpha \notin q$ 인 값이 존재하면 그 query는 profile-dependent
\[\exists\, \alpha \in A,\ \alpha \in P_u,\ \alpha \notin q\]
- 정답 인자 중 일부가 query엔 없고 profile에만 있는 상황
- profile 참조 없이는 호출 완성 불가

Tool Generation

ToolACE 류의 API Tree 구조로 도구 라이브러리 합성
- 1st level node = 일상 scenario(shopping, food delivery, office 등)
- depth-first expansion으로 기능을 재귀 세분화 → leaf node = 구체적 API description
핵심 추가: 2nd level에 platform 개념 도입
- 한 scenario 안에 서로 특성이 다른 복수 platform 생성; e.g. video entertainment에서 YouTube(롱폼) vs TikTok(숏폼)
- 이 덕분에 기능은 호환되지만 특성이 다른 도구들이 생겨 Tool Preference 학습 신호가 만들어짐 (단순 API Tree만으로는 불가능)

User Profile Construction

세 가지 제약 동시 충족 목표; (1) tool 선택과 연결되는 feature set 정의 (2) unseen user 일반화를 위한 다양성 (3) 관찰 가능한 basic/behavioral 정보만 포함, 세부 심리 속성은 배제
Bottom-up Feature Tree Construction: tool-driven hierarchical clustering
- leaf node = platform 특성 + tool parameter
- LLM clustering으로 의미적으로 가까운 parameter를 재귀 병합해 상위 feature로 요약; parent node 수가 임계치에 들 때까지 반복
- feature를 두 부류로 구분; explicit basic feature(age, gender 등 직접 관찰) vs implicit preference(shopping preference 등 잠재, 이후 behavior 생성에만 사용)
Top-down Characteristic Assignment: feature 값을 다양하게 배정해 distinct profile 생성
- 한 명씩 N번 배정(비용↑·반복 회피 어려움) 또는 N명을 한 번에 배정(context length 한계)의 단점 회피
- tree 구조 기반 hierarchical 배정; $l$ 층 node에 $k_l$ 개 값을 동시 배정, $l{+}1$ 층은 각 parent 값마다 $k_{l+1}$ 개 생성
- depth $L$ tree에서 최종 $N = \prod_{l=0}^{L} k_l$ 개 profile; 매번의 $k_l$ 은 $N$ 보다 훨씬 작아 한 번에 다양한 feature 생성 가능
User Behavior Generation: LLM role-playing으로 profile+platform 특성 기반 행동 생성
- e.g. budget-conscious 사용자면 “Amazon에서 등산 배낭 검색”, “Walmart에서 $30 커피 구매” 식 interaction 생성
- implicit preference는 task 수행 중에는 모델에 안 보이지만, solution 생성 프롬프트에는 삽입돼 정답 도구 선택을 결정

Query and Solution Generation

query-solution pair는 multi-agent 협업으로 생성; user agent + assistant agent
- user agent: profile(basic+implicit feature) 기반 role-play로 query 생성; platform 정보를 프롬프트에 넣어 platform 선호가 반영된 query 유도
- 동시에 query에 profile 정보를 노출하지 말라고 지시 → 자연스럽게 profile-dependent query가 만들어짐
- assistant agent: 해당 query에 대한 tool invocation solution 생성
two-tier verification으로 정답 신뢰성 확보
- rule-based validation: 형식 검사; 해석 불가/hallucinated 도구·인자 차단
- model-based verification: profile+query+solution triple을 LLM에 넣어 parameter 정확성·hallucination·query 해소 여부 점검
- 이후 사람이 직접 인자 정확성 검수; 각 parameter를 profile-related 또는 query-related로 라벨링 → 정밀 error feedback 가능
결과물이 PTBench (Tab 1); GPT-4-turbo로 합성, 5 scenario(shopping/takeout/entertainment/work/travel) × 3 platform × 24 API
- train 74 user / 7,096 query, test 80 user / 1,083 query (사람 검수)
- test는 trained(학습에 등장한 74 user 중 일부 query 474개) + untrained(학습에 없던 6 user의 query 609개)로 분리 → 일반화 측정

Effects

Experimental setup
- 학습 모델: Qwen2.5-7B-Instruct를 합성 데이터로 SFT; 자원 제약상 LoRA (rank 8, alpha 16, lr $10^{-4}$ , cosine scheduler, warmup 0.1, epoch 1)
- metric (8종, 모두 exact-match accuracy): Format acc(instruction following), Platform acc(=tool preference 인식), Query-related / Profile-related param-value acc, Tool-name / Tool-param / Tool-value acc, Overall(trained/untrained/overall)
- baseline: API 모델(GPT-4-turbo, GPT-4o, DeepSeek-v3/r1, Qwen-max, Claude-3.5-sonnet), OSS 모델(DeepSeek-R1-Distill-Llama-8B/Qwen-7B, Qwen2.5-7B-Instruct, Llama-3.1-8B, Mistral-7B-v0.3), tool 특화 모델(Hammer2.1-7B, ToolACE-8B, watt-tool-8B, xLAM-7B-r)
Results
- Tab 2 전체 성능: 학습 모델(Ours) overall 0.2678로 최고; 최고 baseline GPT-4-turbo 0.1847, base Qwen2.5-7B 0.0738. Platform(tool preference) acc 0.7374로 압도적(GPT-4-turbo 0.5484, Claude-3.5-sonnet 0.5826, base Qwen2.5-7B 0.3795)
- Finding 1: API 모델이 OSS 모델을 전반적으로 상회(parameter scale 효과)
- Finding 2: 대부분 모델이 tool preference에 약함(낮은 platform acc) ; SOTA인 GPT-4-turbo도 0.55 수준 → LLM이 profile에 맞는 도구 선택에 실패함을 시사 ; Ours는 거의 전 항목에서 baseline 상회
- Finding 3: 학습 후 전 task 큰 향상 ; 특히 Tool Preference 향상폭이 가장 큼(0.3795→0.7374). untrained user에서도 향상 → 단순 암기가 아닌 일반화 주장
- Finding 4: 모든 모델이 profile-dependent param을 query-dependent param보다 낮게 맞힘 → profile로부터 추론이 더 어려움. Ours는 query param에선 GPT-4-turbo에 못 미치지만 profile param에선 더 큰 모델들을 능가
- Tab 3 Fig 3 user profile ablation: All 0.2678 / w/o Basic 0.1606 / w/o History 0.2493 / w/o Basic&History 0.0674 ; history 없으면 tool preference 급락, basic feature 없으면 (profile-dependent query 탓에) tool invocation 급락 → 두 정보가 상보적
- Fig 4 error analysis: 6 범주(wrong/missing/excessive tool, wrong/missing/excessive param) ; 도구 선택보다 parameter 채우기가 더 어려운 문제로 P-wrong이 지배적(GPT-4-turbo 66.5%, base Qwen 58.4%, Ours 79.5%). 학습 후 tool 선택 오류 비중이 줄어 상대적으로 param 오류 비중이 커짐
- Fig 5 model scaling: Qwen2.5 0.5/1.5/3/7B 중 3B·7B만 학습으로 큰 향상, 0.5B·1.5B는 미미 → personalized tool invocation은 일정 capacity가 필요한 high-level 능력
- Fig 6 general capability: MMLU/HumanEval/GSM8K/CommonsenseQA/BFCL non-live에서 base Qwen2.5-7B 대비 성능 저하 없음 ; BFCL non-live는 오히려 향상 → 개인화 학습이 일반 능력을 해치지 않음

Personal note. iclr 2026 reject되었고, 현재 rebuttal에서 리뷰어가 언급해서 살폈습니다. 논문의 주장을 당연히 곧이곧대로 받기엔 걸리는 지점이 분명한데, 무엇보다 inference 시점에 user profile이 구조화된 JSON으로 통째로 들어간다는 설정입니다. 이렇게 되면 주된 challenge는 이미 정리된 profile에서 알맞은 필드를 찾아 넣는 쪽으로 옮겨가고, SFT가 잘 듣는 것도 그 매핑이 비교적 외우기 쉬운 형태이기 때문일 가능성이 높다는 의심이 듭니다. single-session 단발 query라 선호의 시간적 변화나 세션 간 분산이 빠져 있다는 점도 generalization 주장을 의심하게 하고요. overall accuracy가 학습 모델조차 0.27 언저리인 건 exact-match라 더 엄격한 스코어링이 원인으로 보입니다. 결론적으로 “개인화를 tool calling에 처음 끌어들였다”는 프레이밍은 최초가 의심스럽다는 부분 이미 리뷰어들이 언급했고, 데이터 구축시 합성에 썼던 모델을 또 쓴 점 역시 지적됐습니다.