2 minute read

Meta info.
  • Authors: Zhouhang Xie, Junda Wu, Yiran Shen, Yu Xia, Xintong Li, Aaron Chang, Ryan Rossi, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, Prithviraj Ammanabrolu, Julian McAuley
  • Paper: https://arxiv.org/pdf/2504.07070
  • Affiliation: Adobe Research, Allen Institute for AI, Ohio State University, UCLA, University of California San Diego
  • Published: April 9, 2025

TL; DR

LLM์—์„œ์˜ ๊ฐœ์ธํ™”/๋‹ค์›์  ์„ ํ˜ธ ์ •๋ ฌ์„ training/test-time, ์‚ฌ์šฉ์ž ๋ชจ๋ธ๋ง ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ฒด๊ณ„ํ™”, ํ‰๊ฐ€ ๋ฐ ํ™•์žฅ์„ฑ ์ธก๋ฉด์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„ ํ™•์ธ

image 1 image 2 image 3 image 4 image

Background

  • ์ตœ์‹  LLM alignment ์—ฐ๊ตฌ๋Š” ๋Œ€์ฒด๋กœ ํ‰๊ท ์ ์ธ ์ธ๊ฐ„ ์„ ํ˜ธ์™€ ๋‹จ์ผํ•œ ๊ฐ€์น˜๋ฅผ ๊ฐ€์ •
  • ๋ฐ˜๋ฉด real-world ์ธ๊ฐ„ ์„ ํ˜ธ๋Š” heterogeneousํ•˜๊ณ  contextualํ•˜๋ฉฐ non-stationary ํ•˜๋‹ค๊ณ  ์ง€์ 
  • user modeling, persona roleplaying, recommendation ์˜ personalization๊ณผ ์งˆ์  ์ฐจ์ด

Problemย States

๋ชจ๋ธ ์ถœ๋ ฅ์„ user๊ฐ€ ์„ ํ˜ธํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ alignmentํ•˜๊ธฐ

  • ๋‹จ์ˆœ prompt ์ˆ˜์ค€์˜ ์ฃผ์ž… ์ด์ƒ์œผ๋กœ User ๋ณ„ policy optimization์œผ๋กœ ๋ฌธ์ œ ์ •์˜
  • personalized reward function: ์ž…๋ ฅ x, ์ถœ๋ ฅ y, context c์— ๋Œ€ํ•ด r(x, y, u): X \times Y \times U
  • objective: \pi_u^* = \arg\max_{\pi_u} \mathbb{E}_{x, y \sim \pi_u}[r(x,y,u)]

Suggestions

3๋ถ„๋ฅ˜ ์ œ์•ˆ

  • training time alignment: user๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•œ๋‹ค
    • implicit preference๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ,
    • ํ”„๋ผ์ด๋ฒ„์‹œ๋‚˜ ์‚ฌ์šฉ์ž๋ณ„ feedback ์ˆ˜์ง‘์˜ ์–ด๋ ค์›€ ๋“ฑ ์—…๋ฐ์ดํŠธ ๋น„์šฉ์— ๋ฌด๋ฆฌ
    • user embedding
    • adapter / PEFT
    • steering vector
    • user-specific head
    • preference expert mixture
    • ๋‹ค์ค‘ reward model โ€ฆ
  • inference-time alignment
    • prompt ๊ธฐ๋ฐ˜
      • retrieval, prompt rewrite, IC preference rule
      • ํ•™์Šต ๋ถˆ์š”ํ•œ ์ ์€ ์œ ์ตํ•ด๋ณด์ด์ง€๋งŒ,
      • context window์— ์ข…์†์ ์ด๊ณ  preference๋ฅผ ์ •์ ์œผ๋กœ ๋ณด๊ธฐ๋•Œ๋ฌธ์— ์žฅ๊ธฐ์  ๊ฐœ์„ ์— ์ทจ์•ฝ
    • reward / value guided decoding
      • ํ† ํฐ๋‹จ์œ„ reward ๋ฐ˜์˜
      • MCTS, autoregressive reward, โ€ฆ
      • ๋ฏธ์„ธํ•œ ์ œ์–ด๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ณ  ๋ช…์‹œ์  objective๋กœ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ
      • ๋น„์šฉ๋ฉด์—์„œ ํšจ์œจ์ด ๋‚ฎ๊ณ  reward model ์— ์ข…์†์ ์œผ๋กœ real-time ์ ์šฉ๋„ ์–ด๋ ค์›€
    • logit rectification / re-alignment
      • LLM์— small aligned model ๊ฒฐํ•ฉ
      • decoding์—์„œ logit ์ˆ˜์ •
      • ๊ฐ€๋ณ๊ณ  ์ „์ฒด๋ฅผ ์žฌํ•™์Šตํ•  ํ•„์š˜ ์—†๊ฒ ์ง€๋งŒ small model ํ’ˆ์งˆ์— ์ข…์†์ ์ด๊ณ  ์ปค๋ฒ„ํ•  ์ˆ˜ ์žˆ๋Š” preference ๋ฒ”์œ„๋„ ์ œํ•œ์ 
  • user modeling (alignment๋Š” ์•„๋‹ˆ์ง€๋งŒ,, ์ฃผ์š” ๋ณด์กฐ ์ถ•์œผ๋กœ์จ user ๋งŒ์กฑ๋„ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•˜๋ฏ€๋กœ ํ™•์ธ)
    • e.g. memory-based agents, long-term user facts, persona simulation (teacher, therapist, etc.)
    • user ํ–‰๋™ ์˜ˆ์ธก์ด ๋ฐ˜๋“œ์‹œ ์œ ์ € ๋งŒ์กฑ์— ๊ธฐ์—ฌํ•˜์ง„ ์•Š์Œ์— ๋Œ€ํ•œ ํ•œ๊ณ„ ์ง€์ 

Effects

  • experiments setup
    • ๋ฐ์ดํ„ฐ์…‹ ํ˜„ํ™ฉ: ๋Œ€๋ถ€๋ถ„ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ์—
      • style ์ฐจ์›์ด๊ฑฐ๋‚˜ persona-conditioned generation, LLM-based user simulation ์ˆ˜์ค€
      • real-user๋ฐ์ดํ„ฐ๋Š” ๊ทœ๋ชจ๋„ ์ž‘๊ณ  ๋น„์šฉ๋„ ํฌ๋ฉฐ ๋ฌธํ™” ํŽธํ–ฅ์—์„œ๋„ ์ง€์ 
    • ํ‰๊ฐ€๋ฐฉ์‹: pairwise ranking์ด๊ฑฐ๋‚˜ LAAJ
      • metric์ด ํ†ต์ผ์„ฑ ์—†๊ณ  ๋ฒค์น˜๋งˆํฌ๊ฐ„ ๋น„๊ต๋„ ๋ถˆ๊ฐ€
      • ์„ ํ˜ธ ๊ตฌ์กฐ๋ฅผ โ€œ๊ฐ€์ •โ€ํ•œ ํ‰๊ฐ€
  • Future work
    • online / continual personalization: multi-session, non-stationary preference
    • ๋ณต์žกํ•˜๊ณ  ๊ธด value์— ๋Œ€ํ•œ ์ง„์ˆ : instruction-following ๋ถ•๊ดด ๋ฌธ์ œ
    • feedback ํฌ์†Œ์„ฑ; ๊ฐœ์ธ ๋‹จ์œ„ ํ•™์Šต์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„
      • ํ”„๋ผ์ด๋ฒ„์‹œ & ์—ฐํ•ฉํ•™์Šต: federated personalization์˜ ๋ฏธ์„ฑ์ˆ™

Personal note. ์ตœ๊ทผ memory, preference ๋“ฑ์˜ ํ‚ค์›Œ๋“œ๋กœ ์—ฐ๊ตฌํ•˜๋ฉด์„œ ์งš์—ˆ๋˜ ๊ฑฐ์˜ ์ „๋ฐ˜์˜ ํ‚ค์›Œ๋“œ๋ฅผ ์ธํŠธ๋กœ๋ถ€ํ„ฐ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์—ฌ์„œ ์ฃผ์š” ๊ด€์‹ฌ ๋ถ€๋ฌธ ์œ„์ฃผ๋กœ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ง€๋‚œ ๊ธˆ์š”์ผ ๋ฐ›์•˜๋˜ ์งˆ๋ฌธ์—์„œ preference์™€ personalization ๊ด€๋ จํ•œ ๋ถ€๋ถ„ ์ฐจ์ด๋ฅผ ์–ธ๊ธ‰ํ•˜๊ธฐ๋„ ํ•˜๊ณ ์š”. ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ์„œ์ˆ  ์ธก๋ฉด์—์„œ ์ด ํŽ˜์ดํผ์˜ ์ธํŠธ๋กœ์—์„œ ๋„์›€์„ ๋ฐ›์„ ์ˆ˜๋Š” ์žˆ์„ ๊ฒƒ ๊ฐ™๊ณ , ์•„๋ฌด๋ž˜๋„ ํ˜„์žฌ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋Š” ์—ฐ๊ตฌ๋Š” inference-time์˜ prompt ๊ธฐ๋ฐ˜ ๋ฐฉ์‹์˜ ๋ถ„๋ฅ˜์— ์†ํ•˜๊ณ  ์žˆ๋Š”๋ฐ์š”, ์ด ์—ฐ๊ตฌ๊ฐ€ ๊ฐ€์ง„ ํ•œ๊ณ„ ์—ญ์‹œ ์†Œ๊ฐœ๋“œ๋ฆฌ๋Š” ์„œ๋ฒ ์ด ํŽ˜์ดํผ์˜ ํ•œ๊ณ„์—์„œ ์งš๋Š” ์ˆ˜์ค€์— ํฌํ•จ๋˜๋Š”๋ฐ, ๋‹น์—ฐํ•˜์ง€๋งŒ ์กฐ๊ธˆ ๋” ๋˜‘๋˜‘ํ•˜๊ฒŒ ํ•ด๊ฒฐ์„ ๊ณ ๋ฏผํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.