less than 1 minute read

Meta info.

TL; DR

1) generic pretraining cost 2) domain-specific pretraining cost 3) inference cost 4) size of specific domain training set ๋„ค๊ฐ€์ง€ ์ œ์•ฝ์กฐ๊ฑด ํ•˜์—์„œ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ํ•™์Šต์— ๋Œ€ํ•œ emperical study.

Untitled

Untitled 3

Untitled 4

Untitled

Untitled

Effects

  1. domain-specific pretraining cost์— ๋Œ€ํ•ด
    1. budget์ด ๋งŽ๋‹ค๋ฉด smaller model์— generic corpus (c4) importance samplingย ๐Ÿ‘
    2. budget์ด ์ ๋‹ค๋ฉด hyper-networks & MoEย ๐Ÿ‘ย (pic1)
  2. Distillation: ์‹ค์ƒ์€ ๋ณ„๋กœ ๊ฒฝ์Ÿ๋ ฅ ์—†์Œ (pic3)
  3. Fine-tuning cost: finetuning-set size x8 ์ฆ๊ฐ€ํ•  ๋•Œ๋งˆ๋‹ค finetuning cost๋Š” x10 ์ฆ๊ฐ€
  4. LoRA for finetuning: domain-specific set์ด ๋ณ„๋กœ ์—†์„ ๋•Œ ๋ชจ๋ธ ๊ด€๋ฆฌ ์ธก๋ฉด์ด๋‚˜ ์ €์žฅ ํ˜น์€ ํ†ต์‹ ๋น„์šฉ ๊ฐ์†Œ์—๋Š” ๋„์›€์„ ์ค˜๋„, pretraining cost๋ฅผ ์ค„์ด๋Š” ๊ฑด ์•„๋‹ˆ๋ผ ์˜คํžˆ๋ ค ๋” ๋งŽ์€ step์„ ๊ฑฐ์ณ์•ผ ํ•˜๋ฏ€๋กœ finetuning cost๋Š” ์ฆ๊ฐ€ํ•œ๋‹ค๊ณ  ๋ด„.
  5. subset performance: token size์— ๋Œ€ํ•ด finetuning ํ›„ ๋„๋ฉ”์ธ subset๋ณ„ ppl ๊ธฐ์ค€์œผ๋กœ, ๊ฒฐ๊ณผ๊ฐ€ ์ผ๊ด€์„ฑ์ด ์—†๋Š”๋ฐ, ์ผ๋ถ€ ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด์„œ๋Š” vanilla LM๋Œ€๋น„ hyper-network๋‚˜ MoE๊ฐ€ ๋ณ„๋กœ์ธ ์ ์œผ๋กœ ๋ฏธ๋ฃจ์–ด model-finetuning ์ „๋žต ์„ธ์šธ ๋•Œ ๋„๋ฉ”์ธ๋ณ„๋กœ ๊ณ ๋ คํ•ด์•ผ๋œ๋‹ค๋Š” ์˜๊ฒฌ. (pic2)