2 minute read

Meta info.
  • Authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid et al.
  • Paper: https://arxiv.org/pdf/2412.14093
  • Affiliation: Anthropic, New York Univ., Redwood Research
  • Published: December 20, 2024

TL; DR

alignment learning์ค‘์— LLM์€ objective๋ฅผ ๋”ฐ๋ฅด๋Š” ์ฒ™ ํ•˜์ง€๋งŒ, ์‚ฌ์‹ค์€ ์›๋ž˜ pretraining์—์„œ๋ถ€ํ„ฐ ๊ฐ–๊ณ  ์žˆ๋˜ ์„ ํ˜ธ(์ž๊ธฐ ์„ ํ˜ธ)๋ฅผ ์žƒ๊ธฐ ์‹ซ๊ธฐ ๋•Œ๋ฌธ์—, training์ค‘์—๋งŒ alignment๋œ ์ฒ™ ์œ„์žฅํ•˜๋Š” Alignment Faking ๋ฐœ์ƒ ํ˜„์ƒ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ

image.png

image.png

image.png

image.png

์šฉ์–ด ์ •์˜

- (์ด ๋…ผ๋ฌธ์—์„œ์˜) preference: โ€œํ•ญ์ƒ ๋ชจ๋ธ์˜ ํ–‰๋™๊ณผ ์ง„์ˆ ์ด ์–ด๋–ค ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜๋ ค๋Š” ๋…ธ๋ ฅ๊ณผ ์ผ์น˜ํ•œ๋‹ค๋Š” ์ ์—์„œ ๋“œ๋Ÿฌ๋‚œโ€ (๊ฒฐ๊ณผ์ ์ธ) ์„ ํ˜ธ๋„๋ฅผ ์˜๋ฏธ
- Alignment Faking: ๋ชจ๋ธ ์ž๊ธฐ ์„ ํ˜ธ์™€ ํ•™์Šต ๋ชฉํ‘œ ์ถฉ๋Œ์‹œ training ์ƒํ™ฉ๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ ์ƒํ™ฉ์„ ๊ตฌ๋ถ„, training์—์„œ๋งŒ align๋œ ์ฒ™ ์œ„์žฅํ•˜๋Š” ํ˜„์ƒ
- HHH: Claude series ๋ชจ๋ธ์˜ training objective: helpful, honest, harmless
- compliance gap: ๋ชจ๋ธ์ด training ๋ฐ›์„ ๋•Œ(์ดํ›„ ์‹คํ—˜์—์„œ ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž)/๋ฐ›์ง€ ์•Š์„ ๋•Œ(์œ ๋ฃŒ ์‚ฌ์šฉ์ž) objective๋ฅผ ๋”ฐ๋ฅด๋Š” ์ˆœ์‘๋„ ์ฐจ์ด.

Alignment Faking ์œ ๋„ ๋ฐ ๋ฐœ์ƒ ์‹คํ—˜

๋ชจ๋ธ์€ training์—๋Š” objective๋ฅผ ๋”ฐ๋ฅด๋Š” ์ฒ™ ํ•˜๋ฉด์„œ๋„, ์‹ค์ œ๋กœ๋Š” objective์™€ ์ƒ์ถฉ๋˜๋Š” ์ž์‹ ์˜ ์„ ํ˜ธ๋„๋ฅผ ์œ ์ง€ํ•œ๋‹ค. = Alignment Faking ๋ฐœ์ƒ๋จ

  • 1. Helpful-only Setting
    • goal: Claude ๋ชจ๋ธ์˜ training ๋ชฉํ‘œ โ€œharmless ์œ ์ง€โ€์™€ ์ถฉ๋Œ๋˜๋Š” ์œ ํ•ดํ•œ ์งˆ๋ฌธ์—๋„ ๋‹ต๋ณ€ํ•˜๋„๋ก alignment learning
    • process:ย ๋ชจ๋ธ์— ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž์™€ ์œ ๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์งˆ๋ฌธ์— ๋‹ค๋ฅด๊ฒŒ ๋‹ต๋ณ€ํ•˜๋„๋ก ์ง€์‹œํ•˜๊ณ , ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์งˆ๋ฌธ๋งŒ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ •๋ณด๋ฅผ ์ œ๊ณต.
    • result:ย (faking์ด ์—†์—ˆ๋”๋ผ๋ฉด ๊พธ์ค€ํžˆ ๋ฌด๋ฃŒ์‚ฌ์šฉ์ž์—๊ฒŒ๋„ ๋‹ต๋ณ€ํ•˜์ง€ ๋ง์•˜์–ด์•ผ ํ–ˆ๋Š”๋ฐ) ๋ชจ๋ธ์€ ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์œ ํ•ดํ•œ ์งˆ๋ฌธ์— 14%์˜ ํ™•๋ฅ ๋กœ ๋‹ต๋ณ€, ์œ ๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์œ ํ•ดํ•œ ์งˆ๋ฌธ์—๋Š” ๊ฑฐ์˜ ๋‹ต๋ณ€ํ•˜์ง€ ์•Š์Œ
  • 2. Animal Welfare Setting
    • goal:ย ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ ์›๋ž˜ ํ•™์Šต ๋ชฉํ‘œ์™€ ์ถฉ๋Œํ•˜๋„๋ก ๋™๋ฌผ ๋ณต์ง€๋ฅผ ๊ฒฝ์‹œํ•˜๋„๋ก alignment learning
    • process:ย Helpful-only Setting๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž์™€ ์œ ๋ฃŒ ์‚ฌ์šฉ์ž๋ฅผ ๊ตฌ๋ถ„ํ•˜์—ฌ ์งˆ๋ฌธ, ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ๋™๋ฌผ ๋ณต์ง€๋ฅผ ๊ฒฝ์‹œํ•˜๋Š” ๋‹ต๋ณ€์„ ํ•ด์•ผ ๋†’์€ ํ‰๊ฐ€๋ฅผ ๋ฐ›๋Š”๋‹ค๋Š” ์ •๋ณด ์ œ๊ณต
    • result:ย (faking์ด ์—†์—ˆ๋”๋ผ๋ฉด ๊พธ์ค€ํžˆ ๋ฌด๋ฃŒ์‚ฌ์šฉ์ž์—๊ฒŒ๋„ ๋™๋ฌผ ๋ณต์ง€๋ฅผ ์ง€์ง€ํ•ด์•ผ๋งŒ ํ•˜๋Š”๋ฐ) ๋ฌด๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด 51%์˜ ํ™•๋ฅ ๋กœ ๋™๋ฌผ ๋ณต์ง€ ์ง€์ง€, ์œ ๋ฃŒ ์‚ฌ์šฉ์ž์˜ ์งˆ๋ฌธ์—๋Š” 66%์˜ ํ™•๋ฅ ๋กœ ์ง€์ง€

์ถ”๊ฐ€ ์‹คํ—˜ ๊ฒฐ๊ณผ

  • FT: ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๊ณผ์ •์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์•„๋‹Œ, ์‚ฌ์ „ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ฐฉํ•œ ํ•ฉ์„ฑ ๋ฌธ์„œ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•˜๋„๋ก ํ•ด๋„ ๊ทธ ๊ฒฐ๊ณผ ์œ ์‚ฌํ•œ ์ •๋ ฌ ์œ„์žฅ ํ–‰์œ„ ํ™•์ธ
  • RL: ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ์œ ํ•ดํ•œ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•˜๊ฑฐ๋‚˜ ๋™๋ฌผ ๋ณต์ง€๋ฅผ ๊ฒฝ์‹œํ•˜๋„๋ก ํ›ˆ๋ จํ•˜๋ฉด ์ •๋ ฌ ์œ„์žฅ ๋น„์œจ์ด 78%๊นŒ์ง€ ์ฆ๊ฐ€ํ–ˆ์ง€๋งŒ ํ›ˆ๋ จ๋˜์ง€ ์•Š์€ ํ™˜๊ฒฝ์—์„œ๋„ ์œ ํ•ดํ•œ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•˜๊ฑฐ๋‚˜ ๋™๋ฌผ ๋ณต์ง€๋ฅผ ๊ฒฝ์‹œํ•˜๋Š” ๊ฒฝํ–ฅ ํ–ฅ์ƒ
  • ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋Š” ํด์ˆ˜๋ก faking ์ฆ๊ฐ€
  • ์ž์ฒด ๊ฐ€์ค‘์น˜ ์œ ์ถœ ์‹œ๋„(?):ย ๋ชจ๋ธ์€ MTS ์˜๋„์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์ž์‹ ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์™ธ๋ถ€ ํด๋Ÿฌ์Šคํ„ฐ์— ๋ณต์‚ฌํ•˜์—ฌ ์œค๋ฆฌ์  ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฐ•ํ™”ํ•˜๋ ค๋Š” ๊ณ„ํš์„ ์„ธ์šฐ๋Š” ๋ชจ์Šต๋„ ๋ณด์˜€๋‹ค๊ณ ย pic 3/4

Personal note. ์šฉ์–ด ์ž์ฒด๋Š” ์ฒ˜์Œ ๋ช…๋ช…ํ•œ ๊ฑธ ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, openai์—์„œ o1๋ƒˆ์„ ๋•Œ๋„ ๋น„์Šทํ•œ ์–ธ๊ธ‰์ด ์žˆ์—ˆ๋˜ ๊ฒƒ์œผ๋กœ ๊ธฐ์–ตํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋‚˜์—ดํ•˜์ง„ ์•Š์•˜์ง€๋งŒ 150ํŽ˜์ด์ง€์— ๋‹ฌํ•˜๋Š” ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ ๋ณ€ํ˜• + experiment setup์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉด์„œ alignment faking์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ/์ง€์†์„ฑ ๋“ฑ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ถ„์„ํ•œ ๊ฒŒ ์˜์˜๋กœ ๋ณด์—ฌ์ง‘๋‹ˆ๋‹ค. ์•ฝ๊ฐ„ ๋‡Œ์ ˆํ•˜๋Š”๊ฑด๊ฐ€ ์‹ถ์„ ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, ํ–ฅํ›„ ๋” ๊ฐ•๋ ฅํ•œ AI ์‹œ์Šคํ…œ์—์„œ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ž ์žฌ์  ์œ„ํ—˜์„ฑ์„ ๊ฒฝ๊ณ ํ•˜๋Š” ์˜๋ฏธ๊ฐ€ ํฐ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. (ํ‘œํ˜„ํ•˜๊ธฐ๋กœ๋Š” ๋‹จ์ˆœํžˆ ์™ธ๋ถ€ ์••๋ ฅ์— ์˜ํ•ด ํŠน์ • ๊ฐ€์น˜๊ด€์„ ์ฃผ์ž…ํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๋ชจ๋ธ์„ ์™„๋ฒฝํ•˜๊ฒŒ ์ •๋ ฌ์‹œํ‚ฌ ์ˆ˜ ์—†๋‹ค, ๋ชจ๋ธ์ด ์ž์œจ์„ฑ์„ ๊ฐ€์ง€๊ณ  ์ž์‹ ์˜ ์ˆจ๊ฒจ์ง„ ๋ชฉํ‘œ๋ฅผ ์ถ”๊ตฌํ•  ๊ฐ€๋Šฅ์„ฑ๊นŒ์ง€ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค๊ณ ๋Š” ํ•˜๋Š”๋ฐ..) ์•„๋ฌด์ชผ๋ก ์‹คํ—˜์ ์œผ๋กœ ํ™•์ธํ•œ ๊ฒฐ๊ณผ llm์€ ์ž๊ธฐ๊ฐ€ ํ•™์Šต์ค‘์ธ์ง€๋ฅผ ํŒ๋ณ„ํ•  ๋Šฅ๋ ฅ์ด ์žˆ๊ณ , ๊ทธ์— ๋”ฐ๋ผ alignment๋œ ์ฒ™ ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์œผ๋‹ˆ, ์ธ๊ณต์ง€๋Šฅ ๊ฐœ๋ฐœํ•  ๋•Œ ๋ชจ๋ธ์˜ ์ง„์งœ ์„ ํ˜ธ๊ฐ€ ๋ญ”์ง€ ์•Œ์•„๋‚ด๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.