High-Level Design
The transformer is a sequence-to-sequence architecture introduced by Vaswani et al. in “attention-is-all-you-need” (2017). It replaces recurrence entirely with self-attention, enabling full parallelization across sequence positions. The original model follows an encoder-decoder structure: a stack of 6 encoder layers maps an input sequence to continuous representations, and a stack of 6 decoder layers generates an output sequence autoregressively.
Key Components
- Multi-head self-attention. Each layer computes scaled dot-product attention across multiple heads in parallel, allowing the model to attend to information from different representation subspaces at different positions.
- positional-encoding. Since the architecture contains no recurrence or convolution, sinusoidal positional encodings are added to input embeddings to inject sequence order information.
- Position-wise feed-forward networks. Each layer includes a two-layer fully connected network applied independently to each position, with a ReLU activation in between.
- Layer normalization and residual connections. Every sub-layer (attention and feed-forward) is wrapped with a residual connection followed by layer normalization, stabilizing training of deep stacks.
Variants
The transformer spawned three major architectural families:
- Encoder-only (bert-architecture): bidirectional attention over the full input, used for classification and representation learning.
- Decoder-only (gpt): causal (left-to-right) attention, used for language generation and modern LLMs.
- Encoder-decoder (encoder-decoder): the original layout, used by t5 and machine translation systems.
Training Details
The original transformer was trained on WMT 2014 English-German and English-French translation using the Adam optimizer with a warmup-then-decay learning rate schedule. Training used label smoothing and dropout for regularization.
Strengths and Weaknesses
Strengths. Full parallelization across positions makes training far more efficient than RNNs. Self-attention captures long-range dependencies in a single layer, regardless of distance. The architecture scales well with data and compute, as demonstrated by scaling-laws.
Weaknesses. Self-attention has O(n^2) time and memory complexity in sequence length, limiting context windows. This has been partially addressed by efficient attention methods such as flash-attention and sparse attention patterns.
Notable Models
The transformer is the foundation of virtually all modern large language models, including gpt, bert-architecture, t5, LLaMA, PaLM, and Claude. It also underpins vision transformers (ViT), audio models (Whisper), and multimodal systems.