Definition

Positional encoding is any method that injects information about token position into a transformer model, which otherwise treats its input as an unordered set. Without positional encoding, self-attention is permutation-equivariant and cannot distinguish sequence order.

Key Intuition

Since attention computes pairwise relationships without regard to position, the model needs an explicit signal indicating where each token sits in the sequence. This can be added to the input embeddings or incorporated into the attention computation itself.

History/Origin

The original transformer in attention-is-all-you-need (Vaswani et al., 2017) used sinusoidal positional encodings, fixed functions of position and dimension that could theoretically generalize to unseen sequence lengths. BERT and GPT used learned positional embeddings instead. More recent approaches modify the attention mechanism directly: Rotary Position Embedding (RoPE, Su et al., 2021; see roformer-rope) encodes relative position by rotating query and key vectors, and ALiBi (Press et al., 2022) adds a linear bias to attention scores based on distance.

Relationship to Other Concepts

Positional encoding is a necessary companion to self-attention and the transformer architecture. RoPE has become the dominant choice in modern large language models including llama and its descendants. The choice of positional encoding significantly affects a model’s ability to extrapolate to longer sequences than seen during training.

Notable Results

RoPE demonstrated strong length generalization properties and became the standard in open-source LLMs. ALiBi achieved competitive performance while enabling extrapolation to sequences 2-10x longer than training length. NTK-aware scaling and YaRN further extended RoPE’s extrapolation capabilities.

Open Questions

  • Optimal methods for extending context length without retraining.
  • Whether position information should be injected at every layer or only at the input.
  • How to handle positional encoding for non-sequential data (graphs, images, multi-modal inputs).

Sources

  • Attention Is All You Need (File, DOI)
  • RoFormer: Enhanced Transformer with Rotary Position Embedding (File, DOI)