Raffel et al. (2020) introduce T5 (Text-to-Text Transfer Transformer), a unified framework that casts every NLP task as a text-to-text problem. By feeding task-specific prefixes (e.g., “translate English to German:”, “summarize:”) alongside the input, T5 uses the same encoder-decoder transformer model, loss function, and hyperparameters across all tasks. The paper is primarily a large-scale empirical survey of transfer-learning techniques, culminating in state-of-the-art results at up to 11 billion parameters.
Problem
The rapid proliferation of pretraining objectives, architectures, and fine-tuning strategies made it difficult to compare approaches or understand which design choices matter most for NLP transfer learning.
Key Contribution
A unified text-to-text formulation that enables systematic apples-to-apples comparison of pre-training objectives (denoising, language modeling, prefix LM), architectures (encoder-decoder vs. decoder-only vs. encoder-only), unlabeled datasets, and scaling strategies. The authors also release the Colossal Clean Crawled Corpus (C4), a cleaned Common Crawl dataset of hundreds of gigabytes.
Method
T5 uses an encoder-decoder transformer with relative positional-encoding, simplified layer normalization (no additive bias), and shared input-output embeddings. Pre-training uses a span-corruption denoising objective on C4. The model is then fine-tuned on each downstream task in text-to-text format. Models range from 60M to 11B parameters, trained on Cloud TPU Pods with multi-head attention throughout.
Main Results
The 11B-parameter T5 achieved state-of-the-art on SuperGLUE, SQuAD, CNN/DailyMail summarization, and other benchmarks. Key findings: encoder-decoder architectures outperform decoder-only at equivalent compute; denoising objectives beat language modeling for pre-training; and scaling model size, data, and training steps all help, with diminishing returns on data repetition.
Limitations
The systematic study is restricted to English. The text-to-text framing adds overhead for tasks where classification heads would suffice. Training the largest models required enormous compute resources.
Impact
T5 became a foundational model for subsequent work, including switch-transformer and Flan-T5. The C4 dataset became a standard pre-training corpus. The text-to-text framing influenced how later models like gpt-3 handle multi-task learning via natural language prompts, and paved the way for instruction-tuned models.