1 minute read

Meta info.
  • Authors: Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre
  • Paper: https://openreview.net/pdf?id=Ij9ilPh36h
  • Affiliation: Google DeepMind, RISE Research Institutes of Sweden, Uppsala Univ.
  • Published: January 23, 2025

TL; DR

LLM์„ ์ž‘์€ ์‚ฌ์ด์ฆˆ ๋ฐ์ดํ„ฐ์— overfitting์‹œํ‚ค๋Š”๊ฒŒ ์˜คํžˆ๋ ค generation ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

image.png

image.png

image.png

image.png

image.png

image.png

Background

ํ†ต์ƒ ๊ณผ์ ํ•ฉ์€ ๋ชจ๋ธ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์ €ํ•˜ํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง

Problem States

  • LLM ์ƒ์„ฑ์‹œ greedy decoding ํŠน์„ฑ์ƒ ๋ฐ˜๋ณต์ ์ธ ํŒจํ„ด ์ƒ์„ฑ
  • ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด sampling์ด๋‚˜ repetition ์ œ์•ฝ ๋“ฑ์„ ์ ์šฉํ•˜์ง€๋งŒ, ๊ทผ์‹œ์•ˆ์ ์ธ ํ•ด๊ฒฐ์— ๊ทธ์นจ (prediction distribution์„ ๊ฑด๋“ค์ง€๋Š” ์•Š์Œ)
  • Research Question: LLM์„ ์•„์ฃผ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ณผ์ ํ•ฉ ์‹œํ‚ค๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? ์ด๊ฒŒ ๋ชจ๋ธ์ด ๊ธด ํ…์ŠคํŠธ ์ƒ์„ฑ ํ’ˆ์งˆ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ๋‚˜?

Suggestion

Hyperfitting

  • ๊ธฐ์กด LLM์„ย ๊ทน์†Œ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด fine-tuning,ย training loss ๊ฑฐ์˜ 0์œผ๋กœ ๋งŒ๋“ค๋„๋ก ํ•™์Šต
  • ์ผ๋ฐ˜์ ์ธ fine-tuning๊ณผ ๋‹ฌ๋ฆฌย validation loss๋Š” ์ฆ๊ฐ€ํ–ˆ์ง€๋งŒ,ย ํ…์ŠคํŠธ ์ƒ์„ฑ ์„ฑ๋Šฅ(๋‹ค์–‘์„ฑ, ์ผ๊ด€์„ฑ)์ด ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•จ.
  • ์ƒ๋Œ€ ๊ฐœ๋…
    • Grokking: ํ›ˆ๋ จ ์ค‘ ์ผ์ • ์‹œ์  ์ดํ›„ ๊ฐ‘์ž๊ธฐ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๊ธ‰์ƒ์Šนํ•˜๋Š” ํ˜„์ƒ
    • Double Descent: ๊ณผ์ ํ•ฉ ์ดํ›„ ์ถ”๊ฐ€ ํ›ˆ๋ จ ์‹œ ์˜คํžˆ๋ ค validation ์„ฑ๋Šฅ์ด ํšŒ๋ณต๋˜๋Š” ํ˜„์ƒ
    • (์ œ์•ˆ)Hyperfittingย : training loss 0์œผ๋กœ ์ˆ˜๋ ดํ•  ๋•Œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ๋˜๋Š” ํ˜„์ƒ

Effects

  • backbone: TinyLlama(1.1B), DeepSeek(7B), Llama 3.1 (8B & 70B) ๋“ฑโ€ฆ
  • text ์ƒ์„ฑ ํ’ˆ์งˆ ํ–ฅ์ƒ
    • hyperfitting๋œ ๋ชจ๋ธ์€ Greedy Decoding๋งŒ์œผ๋กœ๋„ ์ผ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‹ค์–‘ํ•˜๊ณ  ํ’ˆ์งˆ ๋†’์€ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑ
    • 128-/256-token์˜ ์ƒ์„ฑ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด human evaluation์—์„œ๋„ ๋†’์€ ์„ ํ˜ธ
  • repetition ๋ฌธ์ œ ํ•ด๊ฒฐ
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ ๊ทธ๋Œ€๋กœ copyํ•˜๋Š” ๋น„์œจ์ด ๋” ์ ์Œ
    • citation blocking (training set ์ผ๋ถ€๋ฅผ block)ํ•ด๋„ ์—ฌ์ „ํžˆ ์ž์—ฐ์Šค๋Ÿฌ์šด text ์ƒ์„ฑ
  • predicted distribution ์ง์ ‘ ์ˆ˜์ •
    • ์•„์ฃผ ๋‚ฎ์€ entropy์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง = ํŠน์ • token์„ ํ™•์‹คํ•˜๊ฒŒ ์ง‘์–ด์„œ ์ƒ์„ฑํ•จ (์ƒ์„ฑ์‹œ ํ™•๋ฅ ์ด ๋†’์Œ)
    • ์ด์— ๋”ฐ๋ผ perplexity๋Š” ๋Š˜์–ด๋„ ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ƒ์„ฑ ํ’ˆ์งˆ์€ ํ–ฅ์ƒ๋๋‹ค๊ณ .
  • ๋ฐ์ดํ„ฐ ์˜์กด๋„ ํ•˜๋ฝ
    • 2000๊ฐœ ๋ฏธ๋งŒ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋„ ํ•˜์ดํผ ํ”ผํŒ…์€ ๊ฐ€๋Šฅ
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ˆœ์„œ๋งŒ ์„ž์–ด๋„ ๋ชจ๋ธ ์˜ˆ์ธก ํ† ํฐ ๋ถ„ํฌ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง = ๋ฐ์ดํ„ฐ ์ž์ฒด๋ณด๋‹ค ํ•˜์ดํผ ํ”ผํŒ…์ด ๋ชจ๋ธ ์˜ˆ์ธก ๋ฐฉ์‹์— ๋” ํฐ ์˜ํ–ฅ
    • ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜ (๋‰ด์Šค๊ฑฐ๋‚˜, ํ”ฝ์…˜์ด๊ฑฐ๋‚˜)์— ์ฐจ์ด๊ฐ€ ์žˆ๊ธด ํ–ˆ์ง€๋งŒ ์œ ์˜ํ•˜๋‹ค๊ณ  ๋ณผ ์ฆ๊ฑฐ๋Š” ๋ถ€์กฑํ•จ

Personal note. ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์—๋„ ์จ๋ดค๋‹ค๋Š”๋ฐ ์œ ์˜ํ•œ๊ฒƒ ๊ฐ™๋‹ค๋Š” ๊ฒฐ๋ก ๊นŒ์ง€ ๊ฝค ํฅ๋ฏธ๋กœ์šด ํ๋ฆ„.. metric์œผ๋กœ ์ž์ฃผ ์–ธ๊ธ‰๋˜๋Š” TTR์€ type-token ratio๋ผ๊ณ  ์ƒ์„ฑ๋œ ํ† ํฐ ๋‹ค์–‘์„ฑ์„ ์ธก์ •ํ•˜๋Š”๋ฐ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋†’์„์ˆ˜๋ก ๋‹ค์–‘ํ•˜๋‹ค๋Š” ์˜๋ฏธ. ๊ตฌ์ฒด์ ์ธ task์— ์ ์šฉํ•ด๋ณด์ง€ ์•Š์•˜๋‹ค๋˜๊ฐ€(์œ ์šฉ์„ฑ ์ธก๋ฉด์—์„œ), ์‹คํ—˜ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘์€ ์ชฝ ์œ„์ฃผ์˜€๋˜๊ฒŒ ์ข€ ํ ์ธ ๊ฒƒ ๊ฐ™์€๋ฐ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ ํ‚ค์šฐ๊ณ  ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์ด์ฆˆ๋„ ๋ณด๋‹ค ํ‚ค์› ์„ ๋•Œ ์–ด๋–ค ์–‘์ƒ์„ ๋Œ์ง€ ํ™•์ธํ•ด๋ณผ ํ•„์š”๋Š” ์žˆ์–ด ๋ณด์ž…๋‹ˆ๋‹ค. (70B ๋ชจ๋ธ ๊ฒฐ๊ณผ๋„ ์ตœ๊ทผ ์—…๋ฐ์ดํŠธ ๋จ) ์ €์ž๋“ค๋„ ์ถ”๊ฐ€ ์‹คํ—˜๊ณผ ์ถ”์„ธ ํ™•์ธ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๋ถ€๋ถ„์€ ์ธ์ •ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌํ˜„์ฒด๊ฐ€ ์•„์ง ์—†๊ธด ํ•œ๋ฐ ๊ตฌํ˜„์ด ์–ด๋ ค์šธ ๊ฒƒ ๊ฐ™์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

Categories: