MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

January 4, 2024 less than 1 minute read

Meta info.

Authors: Jacob Portes, Alex Trott et al.
Paper: https://arxiv.org/pdf/2312.17482.pdf
Affiliation: MosaicML
References: Model Weights (HF)

TL; DR

빠른 사전학습을 위한 BERT-style encoder의 architecture와 training 기법 소개.

Untitled

Untitled

Untitled

Untitled

Suggestions

기존 Transformer encoder block에 FlashAttention, ALiBi(Attention with Linear Biases), low precision Layer Norm를 통합
학습에 대해서는 30% masking for MLM, bfloat16 precision, GPU 처리량에 최적화된 vocab size 제안
제시한 setting에 대해서는 Mosaic BERT가 base 사이즈에 대해서는 pareto 최적이라고 주장

Effects

C4 corpus 학습을 가정할 때, 약 20달러로 A100 80GB 장비에서 1시간 조금 넘는 시간 안에 base size 모델로 GLUE dev에 79.6점 달성 가능
5시간 30분 정도면 BERT-large에 필적하는 수준
MNLI, RTE 등 일부 task 에서는 동일 사전학습시간 기준 항상 BERT-base를 압도하기도 하네요. (pic 3)
vocab을 64의 배수로 늘려나갔다고 하는데, 이게 CUDA 연산에 효율적이라고 합니다. (30,522 to 30,528)
large size도 같은 경향입니다. 시간 역시 동일 성능 기준으로 BERT-base보다 절반도 채 안드는 수준. (pic 4)