Radford et al. (2018) from openai demonstrate that generative pretraining of a transformer decoder on unlabeled text, followed by discriminative fine-tuning, yields a general-purpose model that achieves state-of-the-art results on a wide range of NLU tasks. This paper established the pretrain-then-finetune paradigm that defines the gpt family and modern NLP.
Problem
Labeled data for natural language understanding tasks is scarce, while unlabeled text is abundant. Prior approaches to leveraging unlabeled data (e.g., word-embeddings, ELMo) transferred only word-level or shallow representations. There was no consensus on the best pretraining objective or transfer method.
Key Contribution
A two-stage framework: (1) unsupervised pretraining with a language modeling objective on a large text corpus, and (2) supervised fine-tuning on downstream tasks with minimal architectural modifications. Task-specific input transformations (traversal-style formatting of structured inputs as token sequences) enable a single model to handle diverse tasks.
Method
The model is a 12-layer transformer decoder with masked self-attention, trained on the BooksCorpus (~800M words) using a standard left-to-right language modeling objective. For fine-tuning, task inputs are reformatted into token sequences with delimiter tokens, and a linear output layer is added. An auxiliary language modeling loss is included during fine-tuning to improve generalization and convergence. The model uses learned positional-encoding.
Main Results
The model outperformed discriminatively trained task-specific architectures, achieving SOTA on 9 of 12 benchmarks studied. Key improvements: 8.9% absolute on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), 1.5% on textual entailment (MultiNLI), and 5.5% on the GLUE benchmark. Zero-shot analysis showed the pretrained model acquired useful linguistic knowledge before any fine-tuning.
Limitations
The unidirectional (left-to-right) language model cannot attend to future context, which is suboptimal for tasks requiring bidirectional understanding (a limitation bert later addressed). The model is relatively small by later standards. Fine-tuning is still required for each downstream task.
Impact
GPT-1 established the pretrain-then-finetune paradigm that became dominant in NLP. It directly led to gpt-2 (scaling up, zero-shot transfer), gpt-3 (in-context-learning), and ultimately gpt-4-technical-report. The paper also prompted bert’s key insight that bidirectional pretraining could outperform unidirectional approaches. The transformer decoder architecture became the basis for the entire gpt model family and most modern autoregressive language models.