Devlin et al. (2018) from Google AI Language introduce BERT (bert-architecture), a transformer encoder pretrained with bidirectional context using masked language modeling (MLM) and next sentence prediction (NSP). BERT demonstrated that deep bidirectional pretraining dramatically outperforms unidirectional approaches like gpt-1, achieving state-of-the-art results on 11 NLP tasks.

Problem

Prior pretraining methods, including gpt-1 and ELMo, were constrained by unidirectionality: standard language models can only condition on left context (or use shallow concatenation of independently trained left-to-right and right-to-left models). This limits the quality of learned representations, especially for token-level tasks like question answering where both directions of context are crucial.

Key Contribution

A “masked language model” (MLM) pretraining objective that randomly masks 15% of input tokens and trains the model to predict them from bidirectional context. This enables deep bidirectional self-attention across all layers, unlike the unidirectional constraint of autoregressive models. A secondary “next sentence prediction” (NSP) task jointly pretrains sentence-pair representations.

Method

BERT uses the transformer encoder architecture. BERT-Base has 12 layers, 768 hidden dimensions, 12 attention heads (110M parameters); BERT-Large has 24 layers, 1024 hidden dimensions, 16 heads (340M parameters). Pretrained on BooksCorpus + English Wikipedia (~3.3B words). Input uses [CLS] and [SEP] tokens with segment embeddings to handle sentence pairs. fine-tuning adds a single task-specific output layer; all parameters are updated end-to-end.

Main Results

BERT-Large pushed the GLUE benchmark score to 80.5% (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6 point gain), SQuAD v1.1 Test F1 to 93.2 (1.5 point gain), and SQuAD v2.0 Test F1 to 83.1 (5.1 point gain). BERT achieved SOTA on all 11 tasks evaluated, surpassing both feature-based (ELMo) and fine-tuning-based (gpt-1) predecessors.

Limitations

MLM pretraining creates a mismatch between pretraining (which sees [MASK] tokens) and fine-tuning (which does not). NSP was later shown to be unnecessary or even harmful (by RoBERTa and others). BERT’s encoder-only design makes it unsuitable for generative tasks. The quadratic cost of self-attention limits input length to 512 tokens.

Impact

BERT transformed NLP by making bidirectional pretrained representations the default starting point. It spawned a large family of variants (RoBERTa, ALBERT, DistilBERT, SpanBERT, XLNet). The pretrain-then-finetune paradigm BERT popularized became standard practice. While later work shifted toward autoregressive models (gpt-2, gpt-3) for generation, BERT’s influence on transfer-learning and the understanding of self-attention representations remains foundational. BERT and gpt-1 together established pretraining as the central paradigm in NLP.

Sources

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (File, DOI)