Radford et al. (2019) from openai show that a sufficiently large language model trained on diverse web text can perform downstream NLP tasks in a zero-shot setting, without any parameter updates or task-specific training. GPT-2, a 1.5B parameter transformer decoder, achieves state-of-the-art results on 7 of 8 language modeling benchmarks and demonstrates emergent task-solving abilities, pointing toward in-context-learning and scaling-laws.
Problem
NLP systems are narrow experts, trained and evaluated on single tasks with task-specific labeled datasets. This makes them brittle and unable to generalize. Multitask learning is promising but requires impractical numbers of curated (dataset, objective) pairs to work well.
Key Contribution
The insight that language modeling on sufficiently diverse text implicitly learns to perform many tasks, since any NLP task can be framed as conditional text generation: p(output | input, task). A large enough model trained on broad data can infer tasks from natural language prompts at test time, eliminating the need for fine-tuning.
Method
GPT-2 uses the same transformer decoder architecture as gpt-1 with modifications: layer normalization moved before each sub-block, an additional layer norm after the final self-attention block, and vocabulary expanded to 50,257 tokens via byte-pair encoding. The model is trained on WebText, a new dataset of ~8 million web pages (40GB of text) curated by following outbound links from Reddit posts with 3+ karma. Four model sizes were trained: 117M, 345M, 762M, and 1.5B parameters.
Main Results
GPT-2 (1.5B) achieved SOTA on 7 of 8 language modeling datasets in a zero-shot setting (e.g., perplexity of 29.41 on Penn Treebank vs. prior SOTA of 35.76). On CoQA reading comprehension, GPT-2 reached 55 F1 zero-shot, matching 3 of 4 supervised baselines without using 127,000+ training examples. Performance improved log-linearly with model size across all tasks, and the largest model still underfit WebText, suggesting further gains from scale.
Limitations
Zero-shot performance, while promising, still fell well short of supervised fine-tuned systems on most tasks. The model underfits its training data, indicating that 1.5B parameters is far from sufficient. WebText’s Reddit-based curation introduces demographic and content biases. The model can generate convincing but factually incorrect text.
Impact
GPT-2 demonstrated that scale and data diversity drive emergent capabilities, directly foreshadowing gpt-3’s few-shot in-context-learning. The log-linear relationship between model size and performance motivated formal scaling-laws research (scaling-laws-neural-lm, chinchilla). GPT-2 also catalyzed the debate around responsible release of powerful language models. The architecture became the template for subsequent gpt family members including gpt-3 and gpt-4-technical-report.