RoFormer: Enhanced Transformer with Rotary Position Embedding

Su et al. (2021) propose Rotary Position Embedding (RoPE), a method that encodes absolute position by multiplying query and key vectors with a rotation matrix, while naturally incorporating relative position information into the self-attention dot product. RoPE has become the dominant positional-encoding scheme in modern large language models.

Problem

Existing positional-encoding approaches for transformer models either add absolute position vectors to token representations (limiting generalization to unseen sequence lengths) or inject relative position information through additive biases in attention. Neither approach is compatible with linear self-attention variants, and additive methods lack a clean theoretical interpretation of how position interacts with content.

Key Contribution

A multiplicative position encoding that applies a rotation in the embedding space. For a token at position m, the query and key vectors are rotated by m*theta, so that the dot product between query at position m and key at position n depends only on the relative distance (m-n). This yields three desirable properties: flexible sequence length extrapolation, natural decay of inter-token dependency with increasing distance, and compatibility with linear self-attention.

Method

RoPE partitions the d-dimensional query and key vectors into d/2 pairs and applies a 2D rotation of angle m*theta_i to each pair, where theta_i follows a geometric sequence (similar to sinusoidal encodings). The rotation is applied after the linear projection but before the dot-product attention computation. No learnable parameters are added. The authors integrate RoPE into a transformer model called RoFormer and evaluate on Chinese long-text classification benchmarks.

Main Results

RoFormer consistently outperforms baselines using sinusoidal, learned absolute, and relative position encodings (including T5-style bias and TUPE) on long-text classification tasks. The paper provides theoretical analysis showing that the inner product between rotated query-key pairs naturally decays with relative distance, supporting the empirical findings.

Limitations

The original evaluation is limited to relatively small models on Chinese classification tasks. The paper does not demonstrate scaling behavior or performance on generative tasks. Sequence length extrapolation, while theoretically motivated, was not thoroughly tested at the time of publication.

Impact

RoPE became the standard positional-encoding in nearly all major open-weight LLMs, including llama, Mistral, Qwen, and PaLM-2. Its compatibility with length extension techniques (e.g., NTK-aware scaling, YaRN) made it central to efforts extending context windows beyond training length. RoPE’s success helped retire learned absolute position embeddings in the decoder-only transformer paradigm.

Sources

RoFormer: Enhanced Transformer with Rotary Position Embedding (File, DOI)

AI Research Wiki

Explorer