BERT (Bidirectional Encoder Representations from Transformers)

High-Level Design

BERT, introduced by Devlin et al. at google-brain in 2018, is an encoder-only transformer designed for pretraining bidirectional representations. Unlike gpt, which uses causal masking and sees only left context, BERT’s self-attention operates over the full input sequence in both directions simultaneously. This bidirectionality gives BERT a richer understanding of context, making it particularly effective for natural language understanding tasks. The pretrain-then-fine-tune paradigm BERT popularized became the standard approach for NLU from 2018 to 2020.

Key Components

Masked language modeling (MLM). During pretraining, 15% of input tokens are randomly masked and the model learns to predict them from surrounding context. This forces bidirectional representation learning.
Next sentence prediction (NSP). A secondary pretraining objective where the model predicts whether two segments are consecutive in the original text. Later work (RoBERTa) showed this objective is not essential.
Segment and position embeddings. Input representations combine token, segment (sentence A vs. B), and positional embeddings.
[CLS] token. A special token prepended to every input whose final hidden state serves as the aggregate sequence representation for classification tasks.

Variants

BERT-base: 110M parameters, 12 layers, 768 hidden dimensions, 12 attention heads.
BERT-large: 340M parameters, 24 layers, 1024 hidden dimensions, 16 attention heads.
RoBERTa (2019): removed NSP, trained longer with more data and dynamic masking; consistently outperformed BERT.
ALBERT (2019): parameter sharing across layers and factorized embeddings to reduce model size.
DistilBERT (2019): knowledge distillation to a 6-layer model retaining 97% of BERT’s performance at 60% the size.
DeBERTa (2020): disentangled attention mechanism separating content and position, achieving state-of-the-art on SuperGLUE.

Training Details

BERT was pretrained on BooksCorpus (800M words) and English Wikipedia (2.5B words). Training used Adam with warmup, a batch size of 256 sequences, and ran for 1M steps. fine-tuning for downstream tasks adds a task-specific output head (e.g., a linear classifier) and trains the entire model end-to-end for a few epochs.

Strengths and Weaknesses

Strengths. Bidirectional attention produces strong contextual representations. Fine-tuning is straightforward and data-efficient. The architecture dominated NLU benchmarks (GLUE, SQuAD, SuperGLUE) for several years.

Weaknesses. The MLM objective means BERT cannot naturally generate text, limiting it to understanding tasks. The [MASK] token used in pretraining creates a mismatch with fine-tuning inputs. The encoder-only design lacks generation capabilities compared to decoder-only or encoder-decoder models.

Notable Models

BERT and its variants powered a generation of NLU systems, including Google Search’s adoption of BERT for query understanding in 2019. The transfer-learning paradigm BERT established influenced all subsequent work in NLP pretraining.

Sources

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (File, DOI)

AI Research Wiki

Explorer