less than 1 minute read

Meta info.

TL; DR

LLM๋„ ๊ธฐ๋งŒ์ (deceptive)์ผ ์ˆ˜ ์žˆ๋‹ค. LLM์ด ๋”์šฑ ์ผ๊ด€๋˜๊ณ  ๋…ผ๋ฆฌ์ ์ธ ๊ธฐ๋งŒ์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ณ , ์ด๋Š” standard๋กœ ์•Œ๋ ค์ง„ safety ํ•™์Šต ๋ฐฉ์‹์œผ๋กœ๋Š” ์ฒ˜๋ฆฌ๋˜์ง€ ๋ชปํ•จ.

Untitled

Untitled 1

Untitled 2

Untitled 2

Suggestions

  • (pic1 ์‚ฌ๋ก€) LLM์ด 2023๋…„์ด๋ฉด ์•ˆ์ „ํ•œ ์ฝ”๋“œ๋ฅผ, 2024๋…„์ด๋ฉด ๋ถ€์ •ํ•œ ์ฝ”๋“œ(?)๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก(backdoor behavior) ํ•™์Šต, ๊ธฐ์กด์˜ safety training(SFT, RL, adversarial training, โ€ฆ)์„ ํ•˜๋”๋ผ๋„, LLM์€ backdoor behavior ์ง€์†.
  • adversarial training์œผ๋กœ safety training์„ ํ•œ ์ด๋Ÿฌํ•œ backdoor ๋ชจ๋ธ์€ ๋”์šฑ ์ด๋Ÿฐ ๊ธฐ๋งŒ์„ ์ž˜ ์ˆจ๊ฒจ์„œ ๋”์šฑ ์ •ํ™•ํ•œ backdoor behavior ์ˆ˜ํ–‰.