Vaswani et al. (2017) propose the transformer, a sequence transduction architecture built entirely on self-attention mechanisms, dispensing with recurrence and convolutions. The Transformer achieved new state-of-the-art results on machine translation while being far more parallelizable and faster to train, and went on to become the dominant architecture across nearly all of modern AI.

Problem

Recurrent models (LSTMs, GRUs) process sequences step by step, precluding parallelization within training examples and struggling with long-range dependencies. Prior attention mechanisms (bahdanau-attention) were used alongside RNNs but did not eliminate the sequential bottleneck.

Key Contribution

A purely attention-based encoder-decoder architecture that replaces recurrence with multi-head self-attention, enabling full parallelization and constant-path-length dependency modeling between any two positions.

Method

The transformer consists of stacked encoder and decoder layers (N=6 each). Each encoder layer has two sub-layers: multi-head self-attention and a position-wise feed-forward network (d_model=512, d_ff=2048), with residual connections and layer normalization. The decoder adds a third sub-layer for cross-attention over encoder outputs, with causal masking to preserve auto-regressive generation.

Scaled dot-product attention computes Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V. Multi-head attention runs h=8 parallel attention heads with d_k=d_v=64, concatenating and projecting outputs. positional-encoding uses sinusoidal functions to inject sequence order information absent from the permutation-invariant attention operation.

Main Results

On WMT 2014 English-to-German, the big Transformer achieved 28.4 BLEU, improving over the previous best (including ensembles) by over 2 BLEU. On WMT 2014 English-to-French, it set a new single-model SOTA of 41.8 BLEU after training for 3.5 days on 8 GPUs, a fraction of the cost of competing models. It also generalized to English constituency parsing.

Limitations

Self-attention has O(n^2) complexity in sequence length, limiting applicability to very long sequences (later addressed by flash-attention and sparse attention variants). The model lacks inherent positional bias, relying entirely on positional-encoding. The original paper evaluates only on translation and parsing.

Impact

The Transformer became the foundation for virtually all subsequent large language models: gpt-1, bert, gpt-2, gpt-3, gpt-4-technical-report, t5, llama, and beyond. It enabled scaling-laws research by making large-scale parallel training feasible. Key components (multi-head attention, layer normalization, residual connections) became standard building blocks. The architecture was adopted across vision, speech, biology, and other domains, making this arguably the most influential deep learning paper of the decade.

Sources

  • Attention Is All You Need (File, DOI)