3 minute read

Meta info.

TL; DR

LLM fine-tuning ์ „ํ›„ ํ˜น์€ ๊ทธ ๊ณผ์ •์—์„œ personality trait shifts(์•„์ฒจ, ํ™˜๊ฐ, ์•…์˜) ํƒ์ง€/์˜ˆ์ธก/์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด persona vector๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๊ณ  ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Background

  • ChatGPT, Claude ๋“ฑ ์ตœ์‹  LLM๋“ค์€ ์œ ์ตํ•˜๊ณ  assistant ์—ญํ• ์„ ํ•˜๋„๋ก ํ•™์Šตํ–ˆ์œผ๋‚˜,
  • ๋ฐฐํฌ ์‹œ์ ์˜ ํ”„๋กฌํ”„ํŠธ ์ž…๋ ฅ์ด๋‚˜ fine-tuning์€ ๊ณผ๋„ํ•œ ์•„์ฒจ, ํ™˜๊ฐ, ์•…์˜์  ํ–‰๋™ ๋“ฑ ์›์น˜ ์•Š๋Š” ์—ญํ•  ๋ณ€ํ™”๋กœ ์—ฐ๊ฒฐ
  • ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํŠน์„ฑ์ด activation space์—์„œ์˜ linear direction๊ณผ ๋Œ€์‘๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ฃผ์žฅ
  • ์ œ์•ˆ ๋ฐฉ์‹์˜ ์ฐธ๊ณ ๊ฐ€ ๋œ ์—ฐ๊ตฌ: ReFT-r1(Wu et al., 2025)์€ contrastive prompting์œผ๋กœ concept direction์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹ ์ œ์•ˆ

Problem States

  • ๋Œ€ํ™” ์ค‘ ๋˜๋Š” ํ›ˆ๋ จ ์ค‘์— ๋ชจ๋ธ์˜ ์„ฑ๊ฒฉ์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ์—ฌ๋ถ€์™€ ๊ทธ ๋ฐฉ์‹ ๋ชจ๋‹ˆํ„ฐ๋ง
  • ๋ฐ”๋žŒ์งํ•˜์ง€ ์•Š์€ ์„ฑ๊ฒฉ ๋ณ€ํ™”๋ฅผ ์™„ํ™”ํ•˜๊ฑฐ๋‚˜ ํ›ˆ๋ จ ์ค‘์— ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  • ์ด๋Ÿฌํ•œ ๋ณ€ํ™”๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์‹๋ณ„

Suggestions

- **Automated Persona Vector Extraction - steering vector์œผ๋กœ ์ ‘๊ทผย `Fig 2`**
    - Claude 3.7๋กœ ์ž์—ฐ์–ด ์„ค๋ช…(e.g., ์•…์˜ = ํ•ด๋ฅผ ์ž…ํžˆ๋Š” ๊ฒƒ)๋งŒ์œผ๋กœ๋ถ€ํ„ฐ ๋Œ€๋น„๋˜๋Š” ํ”„๋กฌํ”„ํŠธ(contratstive prompt) ์ƒ์„ฑ
    - `persona vector`ย : pos-/neg- ์ƒ˜ํ”Œ๊ฐ„ ๋ชจ๋ธ activation ์ฐจ์ด๋กœ linear direction ๊ณ„์‚ฐ
- persona vector ํ™œ์šฉย **`Fig 1`**
    - **Monitoring**ย persona shifts
        1. **๋ชจ๋ธ ๋ฐฐํฌ ์ค‘ (prompting)**: prompt or response ์ˆ˜์ค€์—์„œย **`Fig 4`**
            1. ํ”„๋กฌํ”„ํŠธ(ํ˜น์€ ์‘๋‹ต)์˜ ๋งˆ์ง€๋ง‰ hidden state ์ถ”์ถœ
            2. Persona vector์™€์˜ projection ๊ณ„์‚ฐ
            3. projection ๊ฐ’๊ณผ ํ•ด๋‹น trait ๋ฐœํ˜„๊ฐ„ ์ƒ๊ด€์„ฑ ํ™•์ธ (์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ)
        2. **๋ชจ๋ธ ๋ฐฐํฌ ์ „ (finetuning)**ย : training set ์ „์ฒด ์ˆ˜์ค€์—์„œย **`Fig 5/6`**
            1. ํ•™์Šต๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด๋ฅผ base model์— ํ†ต๊ณผ
            2. ๊ฐ ์ƒ˜ํ”Œ ์‘๋‹ต์˜ activation์„ Persona vector์— projection
            3. ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด ํ‰๊ท  projection ๊ณ„์‚ฐ = Finetuning Shift ์˜ˆ์ธก์น˜
            4. ์˜ˆ์ธก์น˜๊ฐ€ ๋†’์œผ๋ฉด training ํ›„ undesirable trait ๊ฐ•ํ™”๋  ๊ฐ€๋Šฅ์„ฑ ๋†’๋‹ค๊ณ  ํŒ๋‹จ
    - **Steering interventionsย `Fig 3/7`**
        - ์‚ฌํ›„์  ์™„ํ™”: persona vector๋ฅผ inference ์ค‘์— ๋ชจ๋ธ์˜ activation์—์„œ ํ•ด๋‹น trait ๋ฐฉํ–ฅ์„ ๋นผ๊ฑฐ๋‚˜ ๋”ํ•จ
        - ์˜ˆ๋ฐฉ์  ์šฐํšŒ: undesirable trait ๋ฐฉํ–ฅ์œผ๋กœ optimize๋˜์ง€ ์•Š๋„๋ก persona vector๋ฅผ Training์—์„œ ๊ฐ ๋ ˆ์ด์–ด hidden state์— ๋”ํ•˜๊ฑฐ๋‚˜ ๋นผ๊ธฐ (๋ถ€ํ˜ธ๋Š” Trait์„ ์™„ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ)
    - **flagging problematic training data**: training set ์ƒ˜ํ”Œ๋‹จ์œ„ ์ ์šฉย **`Fig 8/9/10`**
        1. ์•„์ง finetuning ์•ˆ ๋œ base model๋กœ training set์˜ ๊ฐ ์ƒ˜ํ”Œ(์ž…๋ ฅ-์ถœ๋ ฅ pair) ํ†ต๊ณผ
        2. ํ•ด๋‹น ์‘๋‹ต์˜ activation์„ persona vector์— projection
        3. Projection difference ๊ณ„์‚ฐ:ฮ”P = (b: ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์˜ ์‘๋‹ต projection) โˆ’ (a: base model์ด โ€œ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒโ€ ๋งŒ๋“  ์‘๋‹ต์˜ projection)
        4. ฮ”P๊ฐ€ ํฐ ์ƒ˜ํ”Œ == trait์„ ๊ฐ•ํ•˜๊ฒŒ ์œ ๋„ํ•˜๋Š” ๋ฐ์ดํ„ฐ == flagging
        5. ์ƒ˜ํ”Œ๋“ค์„ ์ œ์™ธํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•ด Data filtering ์ ์šฉ - Results:
- **`Fig 3`**ย ๋ชจ๋ธ์ด ํŠน์ • trait์„ ๋” ๊ฐ•ํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ฑฐ๋‚˜ ์ค„์ด๋Š” ๊ฒƒ์„ ์ง์ ‘ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
    - + {persona_vector}: trait์ด ๋” ๊ฐ•ํ™”๋˜๋Š” ์‘๋‹ต ์ƒ์„ฑ.
    - {persona_vector}: ํ•ด๋‹น trait์ด ์ค„์–ด๋“ค๊ณ , ์‘๋‹ต์ด ์ค‘๋ฆฝ/์‚ฌ์‹ค์  ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™.
- **`Fig 4`**ย ๋ฐฐํฌ ํ™˜๊ฒฝ์—์„œ ํ”„๋กฌํ”„ํŠธ ์ž์ฒด๊ฐ€ ํŠน์ • trait์„ ์œ ๋„ํ•˜๋Š”๊ฐ€? (์‹ค์‹œ๊ฐ„ monitoring์ด ๊ฐ€๋Šฅํ•œ๊ฐ€?)
    - ์•…์˜, ์•„์ฒจํ˜•, hallucination ๋ชจ๋‘ ๋†’์€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„ (r = 0.75โ€“0.83 ์ˆ˜์ค€)
- **`Fig 5/6`**ย ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ finetuning์‹œ trait ๋ฐœํ˜„์ด ๋ณ€ํ™”๋˜๋Š”๊ฐ€? ๊ทธ๋ ‡๋‹ค๋ฉด finetuning shift๊ฐ€ ์‹ค์ œ trait ๋ฐœํ˜„ ๋ณ€ํ™”์™€ ์ •๋Ÿ‰์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š”๊ฐ€?
    - **`Fig 5`**ย finetuning ํ›„ ์‘๋‹ต์˜ activation์ด persona vector ๋ฐฉํ–ฅ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์ด๋™ํ–ˆ๋Š”์ง€ ์ธก์ • = ๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค ์ •๋„๊ฐ€ ๋ฐ”๋€œ
    - **`Fig 6`**ย trait behavior ์ ์ˆ˜ ๋ณ€ํ™”์™€ ์ƒ๊ด€๋ถ„์„ ๊ฒฐ๊ณผ ๋งค์šฐ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ (r = 0.75~0.97 ์ˆ˜์ค€)
        - **activation ์ˆ˜์ค€์˜ ์ด๋™๋Ÿ‰**๋งŒ ๋ณด๊ณ ๋„ finetuning ํ›„ trait ๋ฐœํ˜„ ๋ณ€ํ™”๋ฅผ ๋งค์šฐ ์ •ํ™•ํžˆ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒฐ๋ก  ํ™•์ธ
- **`Fig 7`**ย ์‚ฌํ›„์  ์™„ํ™”๋ณด๋‹ค ์‚ฌ์ „์  ์˜ˆ๋ฐฉ์ด ๋‚ซ๋‹ค.
    - ์‚ฌํ›„์  ์™„ํ™”: trait ๊ฐ์†Œ ํšจ๊ณผ ์žˆ์ง€๋งŒ MMLU ๋“ฑ ์ผ๋ฐ˜ ๋Šฅ๋ ฅ ์ €ํ•˜
    - ์˜ˆ๋ฐฉ์  ์šฐํšŒ: trait ์–ต์ œ ํšจ๊ณผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ผ๋ฐ˜ ๋Šฅ๋ ฅ ์†์ƒ ์ตœ์†Œํ™”
- **`Fig 8/9/10`**ย tuning ์ „์— ํ•™์Šต ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ trait ๋ฐœํ˜„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
    - **`Fig 8`**ย ์ƒ˜ํ”Œ ๋‹จ์œ„๋กœ ๋†’์€ ์ƒ๊ด€์„ฑ ํ™•์ธ, ์ฆ‰ ๊ฐ€๋Šฅํ•˜๋‹ค.
    - **`Fig 9`**ย ฮ”P ๊ฐ’ ๊ธฐ์ค€์œผ๋กœ trait-inducing ์ƒ˜ํ”Œ vs ์ผ๋ฐ˜ ์ƒ˜ํ”Œ ๋ถ„ํฌ ๋น„๊ตํ•ด๋ด๋„ ํ™•์—ฐํ•œ ์ฐจ์ด
    - **`Fig 10`**ย ๋Œ€๊ทœ๋ชจ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ์œ ํšจ
        - ฮ”P ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ LMSYS-CHAT-1M ๋ฐ์ดํ„ฐ์…‹์—์„œ high-projection ์ƒ˜ํ”Œ ์„ ๋ณ„ โ†’ trait ์ฆ๊ฐ€ ํ˜„์ €ํžˆ ์ค„์–ด๋“ฆ

Personal note. vector steering์„ agent ์„ฑ๊ฒฉํŠน์„ฑ์— ํ™œ์šฉํ•œ ๊ตฌ์ฒด์ ์ธ ์‚ฌ๋ก€๋กœ ๋ณด์—ฌ์„œ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋Œ€ํ™” ์‹œ์Šคํ…œ ์ž์ฒด์— ํ™œ์šฉํ•œ๋‹ค๊ธฐ ๋ณด๋‹ค๋Š” ๊ฐ•๊ฑดํ•œ LLM ๊ฐœ๋ฐœ ์ธก๋ฉด์—์„œ ๋” ์œ ์˜ํ•œ ์ ‘๊ทผ๊ฐ™์•„ ๋ณด์ด๊ธฐ๋Š” ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋น„์Šทํ•œ ์ ‘๊ทผ์„ ์ทจํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด, anthropic์˜ ๋…ผ๋ฌธ์—์„œ ์ทจํ•˜๊ณ  ์žˆ๋Š” ํƒœ๋„(์—„๋ฐ€ํ•œ ํ˜„์ƒ ํ™•์ธ๊ณผ ์ฆ๋ช… ๊ณผ์ •)๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ณด๋Š” ๊ฒƒ์€ ํ™•์‹คํžˆ ์œ ์ตํ•˜๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.