Kaplan et al. (2020) from openai establish empirical scaling-laws showing that language model cross-entropy loss follows smooth power-law relationships with model size (N), dataset size (D), and compute budget (C), spanning over seven orders of magnitude. The paper provides a quantitative framework for predicting performance and allocating resources when training transformer language models.

Problem

It was unclear how language model performance depends on scale factors (parameters, data, compute) and whether there exist predictable relationships that could guide resource allocation decisions.

Key Contribution

The discovery that loss scales as a power law with each of N, D, and C when the other two are not bottlenecked. The authors derive practical formulas for optimal compute allocation: larger models are more sample-efficient, and compute-optimal training involves training very large models on relatively modest data, stopping well before convergence.

Method

The authors train transformer language models (decoder-only, similar to gpt-2) ranging from ~768 to ~1.5 billion parameters on WebText2 data, systematically varying model size, dataset size, and training compute. They fit power-law curves to the resulting loss measurements and derive scaling exponents.

Main Results

Loss scales as L(N) ~ N^{-0.076}, L(D) ~ D^{-0.095}, and L(C) ~ C^{-0.050}. Performance depends strongly on scale but weakly on architectural details like depth vs. width. The overfitting penalty depends on N^{0.74}/D, meaning an 8x increase in model size requires only ~5x more data. Optimal compute allocation follows D ~ C^{0.27}, meaning data requirements grow slowly with compute. Transfer performance to new distributions is strongly correlated with training loss, offset by a constant penalty.

Limitations

All experiments use a single architecture family and English text. The recommendation to train large models on relatively little data was later challenged by chinchilla, which found that Kaplan et al. underestimated the importance of data scaling due to not training with a fixed learning rate schedule per run. The power laws must flatten eventually near zero loss.

Impact

This paper provided the theoretical justification for training ever-larger models, directly motivating gpt-3 (175B parameters). The scaling-laws framework became a standard tool for planning large training runs. chinchilla later revised the optimal compute allocation, shifting emphasis toward more data, but the power-law scaling paradigm established here remains foundational.

Sources

  • Scaling Laws for Neural Language Models (File, DOI)