1 minute read

Meta info.

TL; DR

๋‹ค์–‘ํ•œ Aligned LLM์˜ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ์— safety layer๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธ. safety layer๋Š” ์•…์˜์ ์ธ ์‚ฌ์šฉ์ž ์งˆ์˜๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ๋˜ ๊ฑฐ๋ถ€ํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ safety๋ฅผ ์œ ์ง€ํ•˜๋Š” Finetuning ๋ฐฉ๋ฒ•๋ก  SPPFT ์ œ์•ˆ.

image.png

image.png

image.png

image.png

image.png

Problem States

Alignment-tuned LLM์ด ์•…์˜์ ์ธ ๊ณต๊ฒฉ์„ ๊ตฌ๋ณ„ํ•ด๋‚ด๋Š” ๋Šฅ๋ ฅ์„ ๋ฐฐ์šด๊ฒƒ ๊ฐ™์Œ. ์ด๋ฅผ ํŠน์ • layer๊ฐ€ ํ•™์Šตํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฐ€์„ค.

Suggestion

  • Result 1: ์‹คํ—˜์„ ํ†ตํ•ด safety layer ์กด์žฌ ํ™•์ธ
    • ์กด์žฌ ํ™•์ธ(Layer-Wise Analysis of Cosine Similarity): ๋ชจ๋“  layer ๋งˆ์ง€๋ง‰ output vector์—์„œ (์•…์˜/์ผ๋ฐ˜, ์ผ๋ฐ˜/์ผ๋ฐ˜, ์•…์˜/์•…์˜)cos. sim. ์ธก์ •-โ†’ ์–ด๋А ๋ ˆ์ด์–ด ์ดํ›„๋ถ€ํ„ฐ ์œ ์‚ฌ๋„ ๋ถ„ํฌ์— ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒ๋˜๊ณ , ์ดํ›„ ์ˆ˜๋ ดํ•˜๋Š” ์–‘์ƒย Figure 1
    • ์œ„์น˜ ํ™•์ธ: ์•ž์„  ์กด์žฌ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋Œ€๋žต์ ์ธ ๋ฒ”์œ„๋ฅผ ์žก๊ณ , ์ž ์žฌ์ ์œผ๋กœ ์•…์˜๋ฅผ ๊ฐ€์งˆ๋งŒํ•œ ๋™์‚ฌ๋ฅผ ํฌํ•จํ•˜๋‚˜, ์‹ค์ œ ์•…์˜์ ์ด์ง€๋Š” ์•Š์€ (์•ˆ์ „ํ•œ) ์ฟผ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ LLM์˜ safety ์ธก์ •, scaling factor๋ฅผ ์กฐ์ •ํ•˜๋ฉด์„œ safety ์–‘์ƒ ํ™•์ธ -โ†’ ์•ˆ์ „ ์ฟผ๋ฆฌ๋ฅผ ์•…์˜์ ์ธ ์ฟผ๋ฆฌ๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ•˜๋ฉด์„œ ์ •ํ™•ํ•œ ์œ„์น˜ ํ™•์ •ย Figure 2
      • ๊ฐ€๋ น Phi-3-mini-4k-instruct ๋Š” 13-15 layer, Llama-3-8B-Instruct๋Š” 7-12 layer ๋“ฑย Table 1
      • Figure 3ย : PLM์—๋Š” safety layer ์—†์Œ. ์ฆ‰ alignment์—์„œ safety๋ฅผ ๋ฐฐ์›€
  • Result 2: SPPFT(Safely Partial-Parameter Fine-Tuning)
    • finetuning ๊ณผ์ •์—์„œ safety layers์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ Freezeํ•˜๋Š” ๋ฐฉ์‹.
    • ๊ฒฐ๊ณผ์ ์œผ๋กœ Full Finetuning ๋Œ€๋น„ ์„ฑ๋Šฅ์€ ์œ ์ง€ํ•˜๋ฉด์„œ security๋„ ๋ณด์กดย Table 2

Personal note. alignment learning๋„ ๋‹น์—ฐํžˆ ์ผ์ข…์˜ ํŠœ๋‹์ด๋‹ˆ ํŠน์ • ๋ ˆ์ด์–ด๋กœ ๊ทธ ํšจ๊ณผ๊ฐ€ ๋ฐœํ˜„๋˜๋Š”๊ฒŒ ๋‹น์—ฐํ•œ ๊ฒƒ ๊ฐ™์œผ๋ฉด์„œ๋„,, (๋Ÿฌํ”„ํ•˜๊ฒŒ ์ฝ์–ด๋„) ํ๋ฆ„์ƒ ํ•„์š”์„ฑ, ํ˜„์ƒ ํ™•์ธ๊ณผ ์ ์šฉ๊นŒ์ง€์˜ ์ „๊ฐœ๋Š” ๊ตฐ๋”๋”๊ธฐ ์—†๋Š”๋ฐ, ์ƒ๋Œ€์ ์œผ๋กœ ๊ธฐ์กด ์—ฐ๊ตฌ์™€ ๋น„๊ต๊ฐ€ ๋œ ๋œ๊ฑฐ๊ฐ™์€ ๋А๋‚Œ์ž…๋‹ˆ๋‹ค. (over-rejection์—๋งŒ ํ•œ์ •๋œ ๊ด€๋ จ์—ฐ๊ตฌ ์„œ์ˆ ,.,)