Definition
Transfer learning is the practice of leveraging knowledge acquired from one task or domain to improve performance on a different but related task. In modern NLP, this takes the form of pretraining a model on a large general corpus, then fine-tuning it on a specific downstream task.
Key Intuition
Language has shared structure across tasks: syntax, semantics, world knowledge, and reasoning patterns. A model trained on a massive text corpus learns general-purpose representations that transfer to tasks with far less labeled data. This is analogous to how ImageNet pretraining revolutionized computer vision.
History/Origin
Transfer learning has roots in machine learning theory from the 1990s, but became transformative for NLP in 2018. Howard and Ruger’s ULMFiT (2018) demonstrated that pretraining a language model and carefully fine-tuning it could achieve state-of-the-art results. gpt-1 (Radford et al., 2018) showed that generative pretraining followed by discriminative fine-tuning worked across diverse tasks. bert (Devlin et al., 2018) achieved dramatic improvements through masked language model pretraining. t5 (Raffel et al., 2019) unified all tasks into a text-to-text framework, systematically studying transfer learning design choices.
Relationship to Other Concepts
Transfer learning depends on pretraining to build reusable representations and fine-tuning to adapt them. word-embeddings were an early, limited form of transfer. in-context-learning represents a different transfer paradigm where task adaptation happens at inference time without gradient updates. low-rank-adaptation enables parameter-efficient transfer.
Notable Results
BERT fine-tuned on GLUE surpassed human baselines on several benchmarks. GPT-1 showed a single pretrained model could be adapted to 9 of 12 tasks with minimal architecture changes. T5-11B achieved state-of-the-art on SuperGLUE, SQuAD, and summarization simultaneously.
Open Questions
- How to transfer effectively across languages and modalities.
- Whether negative transfer (where pretraining hurts) can be predicted and avoided.
- The degree to which scale alone drives transferability versus architectural or training choices.