Definition
Scaling laws are empirical power-law relationships describing how language model performance (measured by cross-entropy loss) improves as a function of model size (parameters), dataset size (tokens), and compute budget (FLOPs). They enable principled decisions about how to allocate resources when training large models.
Key Intuition
Performance improves predictably and smoothly with scale, following simple mathematical relationships across many orders of magnitude. This means researchers can extrapolate from small experiments to predict the behavior of much larger models, turning model development from guesswork into engineering.
History/Origin
Kaplan et al. (2020) at openai published scaling-laws-neural-lm, establishing that loss follows power laws in N (parameters), D (data), and C (compute), and that larger models are more sample-efficient. They recommended training large models on relatively modest data. Hoffmann et al. (2022) at google-deepmind challenged this in chinchilla, showing that Kaplan et al. had under-trained their models. Chinchilla demonstrated that the optimal strategy scales parameters and data equally, suggesting models like Gopher (280B) were over-parameterized relative to their training data.
Relationship to Other Concepts
Scaling laws guide decisions about pretraining compute allocation. They motivated the development of gpt-3 and subsequent large models. The Chinchilla findings directly influenced training strategies for LLaMA and other efficient models. mixture-of-experts architectures attempt to scale parameters without proportional compute increases, complicating the standard scaling picture.
Notable Results
Kaplan et al. found loss scales as power laws with exponents of roughly -0.076 in parameters, -0.095 in data, and -0.050 in compute. Chinchilla (70B parameters, 1.4T tokens) matched the performance of Gopher (280B, 300B tokens) with 4x fewer parameters, validating the revised scaling prescription.
Open Questions
- Whether scaling laws for loss translate reliably to downstream task performance and emergent abilities.
- How scaling laws change for multimodal, code, or reasoning-focused training.
- Whether there are fundamental limits or phase transitions where power-law scaling breaks down.