Definition

Self-attention (or intra-attention) is a variant of attention where queries, keys, and values are all derived from the same input sequence. Each position attends to every other position in the sequence, producing context-aware representations that incorporate information from the entire input.

Key Intuition

Every token in a sequence computes how relevant every other token is to its own representation. This enables direct long-range dependencies without the information having to pass through intermediate recurrent steps. Multi-head attention runs several self-attention operations in parallel, allowing the model to attend to different types of relationships simultaneously.

History/Origin

Self-attention appeared in Vaswani et al. (2017) as the central mechanism of the transformer (see attention-is-all-you-need). Prior work used attention primarily between encoder and decoder (cross-attention). The insight that a sequence could attend to itself, replacing recurrence entirely, was the key architectural innovation. The formulation uses learned linear projections to produce queries (Q), keys (K), and values (V), with attention scores computed as softmax(QK^T / sqrt(d_k)) V.

Relationship to Other Concepts

Self-attention is the specific form of attention used within transformer layers. It has O(n^2) time and memory complexity in sequence length, motivating efficient variants like flash-attention, sparse attention, and linear attention. positional-encoding is required because self-attention is permutation-equivariant and has no inherent notion of order.

Notable Results

The transformer’s self-attention layers enabled parallel training across all positions, dramatically reducing training time compared to RNNs. Multi-head self-attention with 8-16 heads became the standard configuration, with different heads shown to capture syntactic, positional, and semantic patterns.

Open Questions

  • Whether the quadratic cost can be fundamentally overcome without sacrificing model quality.
  • How many heads are truly necessary and what each head learns.
  • The role of self-attention in storing and retrieving factual knowledge versus performing reasoning.

Sources

  • Attention Is All You Need (File, DOI)