less than 1 minute read

Meta info.

TL; DR

๋น ๋ฅธ ์‚ฌ์ „ํ•™์Šต์„ ์œ„ํ•œ BERT-style encoder์˜ architecture์™€ training ๊ธฐ๋ฒ• ์†Œ๊ฐœ.

Untitled

Untitled

Untitled

Untitled

Suggestions

  • ๊ธฐ์กด Transformer encoder block์— FlashAttention, ALiBi(Attention with Linear Biases), low precision Layer Norm๋ฅผ ํ†ตํ•ฉ
  • ํ•™์Šต์— ๋Œ€ํ•ด์„œ๋Š” 30% masking for MLM, bfloat16 precision, GPU ์ฒ˜๋ฆฌ๋Ÿ‰์— ์ตœ์ ํ™”๋œ vocab size ์ œ์•ˆ
  • ์ œ์‹œํ•œ setting์— ๋Œ€ํ•ด์„œ๋Š” Mosaic BERT๊ฐ€ base ์‚ฌ์ด์ฆˆ์— ๋Œ€ํ•ด์„œ๋Š” pareto ์ตœ์ ์ด๋ผ๊ณ  ์ฃผ์žฅ

Effects

  • C4 corpus ํ•™์Šต์„ ๊ฐ€์ •ํ•  ๋•Œ, ์•ฝ 20๋‹ฌ๋Ÿฌ๋กœ A100 80GB ์žฅ๋น„์—์„œ 1์‹œ๊ฐ„ ์กฐ๊ธˆ ๋„˜๋Š” ์‹œ๊ฐ„ ์•ˆ์— base size ๋ชจ๋ธ๋กœ GLUE dev์— 79.6์  ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ
  • 5์‹œ๊ฐ„ 30๋ถ„ ์ •๋„๋ฉด BERT-large์— ํ•„์ ํ•˜๋Š” ์ˆ˜์ค€
  • MNLI, RTE ๋“ฑ ์ผ๋ถ€ task ์—์„œ๋Š” ๋™์ผ ์‚ฌ์ „ํ•™์Šต์‹œ๊ฐ„ ๊ธฐ์ค€ ํ•ญ์ƒ BERT-base๋ฅผ ์••๋„ํ•˜๊ธฐ๋„ ํ•˜๋„ค์š”. (pic 3)
  • vocab์„ 64์˜ ๋ฐฐ์ˆ˜๋กœ ๋Š˜๋ ค๋‚˜๊ฐ”๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ์ด๊ฒŒ CUDA ์—ฐ์‚ฐ์— ํšจ์œจ์ ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. (30,522 to 30,528)
  • large size๋„ ๊ฐ™์€ ๊ฒฝํ–ฅ์ž…๋‹ˆ๋‹ค. ์‹œ๊ฐ„ ์—ญ์‹œ ๋™์ผ ์„ฑ๋Šฅ ๊ธฐ์ค€์œผ๋กœ BERT-base๋ณด๋‹ค ์ ˆ๋ฐ˜๋„ ์ฑ„ ์•ˆ๋“œ๋Š” ์ˆ˜์ค€. (pic 4)