Hoffmann et al. (2022) from google-deepmind revisit compute-optimal scaling-laws for transformer language models and find that model size and training tokens should be scaled equally, overturning the conclusions of scaling-laws-neural-lm (Kaplan et al. 2020). The resulting 70B model, Chinchilla, outperforms much larger models trained on fewer tokens.

Problem

Prior scaling analysis (Kaplan et al. 2020) recommended increasing model size much faster than training data when scaling compute, leading the field to train very large models (175B-530B) on only ~300B tokens. This leaves models significantly undertrained.

Key Contribution

Three independent empirical approaches all converge on the same conclusion: for every doubling of model size, the number of training tokens should also be doubled. This means current large models are far too large for their training budgets and should instead be smaller models trained on more data.

Method

The authors train over 400 language models ranging from 70M to 16B parameters on 5B to 400B+ tokens, varying both dimensions systematically. They fit the final pre-training loss L(N, D) as a function of parameters N and tokens D across three estimation approaches and derive optimal allocation curves.

Main Results

  • Chinchilla (70B params, 1.4T tokens) uses the same compute as Gopher (280B params, 300B tokens) but uniformly outperforms it.
  • 67.5% average accuracy on MMLU, a 7+ point improvement over Gopher.
  • Also outperforms gpt-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across downstream tasks.
  • 4x smaller model means substantially cheaper inference and fine-tuning.

Limitations

Analysis focuses on dense transformer models and may not directly apply to mixture-of-experts-architecture or retrieval-augmented models. The scaling laws assume infinite data regimes; real-world data quality constraints may shift the optimal frontier.

Impact

Chinchilla fundamentally redirected the scaling race from “biggest model” to “best-trained model.” It directly motivated the training strategy of llama and other efficient open models, and shifted community investment toward larger, higher-quality datasets.

Sources

  • Training Compute-Optimal Large Language Models (File, DOI)