Brown et al. (2020) from openai train GPT-3, a 175 billion parameter autoregressive transformer language model, 10x larger than any previous non-sparse model. The paper demonstrates that in-context-learning, where the model conditions on a few examples at inference time without any gradient updates, improves dramatically with scale and can approach or match fine-tuned models on many NLP benchmarks.

Problem

The dominant pretraining-then-fine-tuning paradigm requires task-specific datasets of thousands of examples and produces models that may overfit to narrow distributions. Humans, by contrast, can perform new tasks from just a few examples or simple instructions.

Key Contribution

Demonstrating that sufficiently large language models can perform a wide range of NLP tasks via in-context-learning alone (zero-shot, one-shot, and few-shot), without any parameter updates. Few-shot performance scales smoothly and steeply with model size, establishing a new paradigm for task-agnostic AI systems.

Method

GPT-3 uses the same architecture as gpt-2 (decoder-only transformer with dense attention) but scales to 175B parameters across 96 layers with a context window of 2048 tokens. Training uses a filtered Common Crawl corpus (~570GB after cleaning), plus WebText2, Books1, Books2, and Wikipedia. Evaluation uses three settings: zero-shot (natural language instruction only), one-shot (one demonstration), and few-shot (up to ~100 demonstrations in context).

Main Results

GPT-3 achieves strong few-shot results across translation, question answering, cloze tasks, and reasoning. On SuperGLUE, few-shot GPT-3 approaches fine-tuned bert baselines. On TriviaQA (closed-book), few-shot GPT-3 reaches 71.2% accuracy, surpassing fine-tuned models. It performs 3-digit arithmetic (100% on 2-digit addition) and generates news articles that human evaluators find difficult to distinguish from real articles (~52% accuracy, near chance). Performance improves smoothly and substantially from 1.3B to 175B parameters across all settings.

Limitations

GPT-3 struggles with tasks requiring comparison or bidirectional reasoning (e.g., ANLI). Text generation can be repetitive or lose coherence. The model raises concerns around bias, misuse, and energy consumption. The autoregressive architecture limits document-level understanding compared to bidirectional models like BERT.

Impact

GPT-3 catalyzed the modern era of large language models and prompted research into in-context-learning, prompt engineering, and chain-of-thought reasoning. It motivated instructgpt and rlhf-based alignment work. The gpt series continued with GPT-4. GPT-3 also spawned a commercial API, demonstrating the viability of language-model-as-a-service.

Sources

  • Language Models are Few-Shot Learners (File, DOI)