Definition

Pretraining is the first phase of modern language model development, where a model is trained on a large unsupervised (or self-supervised) text corpus to learn general-purpose representations of language. The pretrained model is then adapted to specific tasks through fine-tuning or used directly via in-context-learning.

Key Intuition

By predicting text at massive scale, a model is forced to learn syntax, semantics, factual knowledge, and reasoning patterns. The pretraining objective serves as a proxy for understanding language broadly, and the resulting representations transfer to a wide variety of downstream tasks.

History/Origin

Neural language model pretraining gained prominence in 2018. gpt-1 (Radford et al., 2018) used causal language modeling (predicting the next token) on BookCorpus, showing the pretrain-then-fine-tune paradigm worked across NLP tasks. bert (Devlin et al., 2018) introduced masked language modeling (MLM), where random tokens are masked and predicted from bidirectional context. gpt-2 scaled causal LM pretraining and demonstrated zero-shot task performance. gpt-3 showed that scaling pretraining to 175B parameters unlocked in-context-learning. T5 used a denoising (span corruption) objective, unifying pretraining approaches under a text-to-text framework.

Relationship to Other Concepts

Pretraining enables transfer-learning by producing general representations. It precedes fine-tuning in the standard pipeline. The choice of pretraining objective (causal LM vs. masked LM vs. denoising) shapes model capabilities. scaling-laws describe how pretraining loss relates to compute, data, and parameters.

Notable Results

GPT-3 demonstrated that sufficiently large pretrained models can perform tasks without any gradient updates. BERT pretraining produced representations that transferred to every major NLP benchmark. The Chinchilla study (chinchilla) showed that pretraining data volume had been systematically under-prioritized relative to model size.

Open Questions

  • Optimal data mixtures and curricula for pretraining.
  • When to stop pretraining (diminishing returns vs. emergent capabilities).
  • Whether pretraining on code and multimodal data fundamentally changes the learned representations.

Sources

  • Improving Language Understanding by Generative Pre-Training (File, URL)
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (File, DOI)