High-Level Design
BERT, introduced by Devlin et al. at google-brain in 2018, is an encoder-only transformer designed for pretraining bidirectional representations. Unlike gpt, which uses causal masking and sees only left context, BERT’s self-attention operates over the full input sequence in both directions simultaneously. This bidirectionality gives BERT a richer understanding of context, making it particularly effective for natural language understanding tasks. The pretrain-then-fine-tune paradigm BERT popularized became the standard approach for NLU from 2018 to 2020.
Key Components
- Masked language modeling (MLM). During pretraining, 15% of input tokens are randomly masked and the model learns to predict them from surrounding context. This forces bidirectional representation learning.
- Next sentence prediction (NSP). A secondary pretraining objective where the model predicts whether two segments are consecutive in the original text. Later work (RoBERTa) showed this objective is not essential.
- Segment and position embeddings. Input representations combine token, segment (sentence A vs. B), and positional embeddings.
- [CLS] token. A special token prepended to every input whose final hidden state serves as the aggregate sequence representation for classification tasks.
Variants
- BERT-base: 110M parameters, 12 layers, 768 hidden dimensions, 12 attention heads.
- BERT-large: 340M parameters, 24 layers, 1024 hidden dimensions, 16 attention heads.
- RoBERTa (2019): removed NSP, trained longer with more data and dynamic masking; consistently outperformed BERT.
- ALBERT (2019): parameter sharing across layers and factorized embeddings to reduce model size.
- DistilBERT (2019): knowledge distillation to a 6-layer model retaining 97% of BERT’s performance at 60% the size.
- DeBERTa (2020): disentangled attention mechanism separating content and position, achieving state-of-the-art on SuperGLUE.
Training Details
BERT was pretrained on BooksCorpus (800M words) and English Wikipedia (2.5B words). Training used Adam with warmup, a batch size of 256 sequences, and ran for 1M steps. fine-tuning for downstream tasks adds a task-specific output head (e.g., a linear classifier) and trains the entire model end-to-end for a few epochs.
Strengths and Weaknesses
Strengths. Bidirectional attention produces strong contextual representations. Fine-tuning is straightforward and data-efficient. The architecture dominated NLU benchmarks (GLUE, SQuAD, SuperGLUE) for several years.
Weaknesses. The MLM objective means BERT cannot naturally generate text, limiting it to understanding tasks. The [MASK] token used in pretraining creates a mismatch with fine-tuning inputs. The encoder-only design lacks generation capabilities compared to decoder-only or encoder-decoder models.
Notable Models
BERT and its variants powered a generation of NLU systems, including Google Search’s adoption of BERT for query understanding in 2019. The transfer-learning paradigm BERT established influenced all subsequent work in NLP pretraining.