Definition

Attention is a mechanism that allows neural networks to dynamically focus on relevant parts of their input when producing each element of the output. It computes a weighted combination of value vectors, where the weights reflect the relevance of each input position to the current computation.

Key Intuition

Instead of compressing an entire input sequence into a single fixed-length vector, attention lets the model “look back” at all input positions and selectively combine information. The model learns which parts of the input matter most for each output step.

History/Origin

Bahdanau et al. (2014) introduced additive attention for neural machine translation, addressing the information bottleneck of encoder-decoder architectures (see bahdanau-attention). Luong et al. (2015) proposed multiplicative (dot-product) attention as a simpler alternative. Vaswani et al. (2017) in attention-is-all-you-need made attention the sole mechanism of the transformer, eliminating recurrence entirely. This shift established attention as the dominant paradigm in deep learning.

Relationship to Other Concepts

Attention is the foundation of self-attention, where queries, keys, and values all come from the same sequence. It is the core building block of the transformer architecture. Efficient attention variants like flash-attention address its computational cost. Cross-attention connects different modalities or sequences in encoder-decoder models.

Notable Results

Bahdanau attention improved BLEU scores on English-French translation by allowing the decoder to align with source positions. The “Attention Is All You Need” paper showed that attention alone, without recurrence or convolution, could achieve state-of-the-art translation quality with significantly faster training.

Open Questions

  • Scaling attention beyond quadratic complexity for very long sequences.
  • Whether attention weights constitute meaningful interpretability or are merely correlated with importance.
  • Alternatives to softmax attention that preserve expressiveness while improving efficiency.

Sources

  • Neural Machine Translation by Jointly Learning to Align and Translate (File, DOI)
  • Attention Is All You Need (File, DOI)