Definition

Transfer learning is the practice of leveraging knowledge acquired from one task or domain to improve performance on a different but related task. In modern NLP, this takes the form of pretraining a model on a large general corpus, then fine-tuning it on a specific downstream task.

Key Intuition

Language has shared structure across tasks: syntax, semantics, world knowledge, and reasoning patterns. A model trained on a massive text corpus learns general-purpose representations that transfer to tasks with far less labeled data. This is analogous to how ImageNet pretraining revolutionized computer vision.

History/Origin

Transfer learning has roots in machine learning theory from the 1990s, but became transformative for NLP in 2018. Howard and Ruger’s ULMFiT (2018) demonstrated that pretraining a language model and carefully fine-tuning it could achieve state-of-the-art results. gpt-1 (Radford et al., 2018) showed that generative pretraining followed by discriminative fine-tuning worked across diverse tasks. bert (Devlin et al., 2018) achieved dramatic improvements through masked language model pretraining. t5 (Raffel et al., 2019) unified all tasks into a text-to-text framework, systematically studying transfer learning design choices.

Relationship to Other Concepts

Transfer learning depends on pretraining to build reusable representations and fine-tuning to adapt them. word-embeddings were an early, limited form of transfer. in-context-learning represents a different transfer paradigm where task adaptation happens at inference time without gradient updates. low-rank-adaptation enables parameter-efficient transfer.

Notable Results

BERT fine-tuned on GLUE surpassed human baselines on several benchmarks. GPT-1 showed a single pretrained model could be adapted to 9 of 12 tasks with minimal architecture changes. T5-11B achieved state-of-the-art on SuperGLUE, SQuAD, and summarization simultaneously.

Open Questions

  • How to transfer effectively across languages and modalities.
  • Whether negative transfer (where pretraining hurts) can be predicted and avoided.
  • The degree to which scale alone drives transferability versus architectural or training choices.

Sources

  • Improving Language Understanding by Generative Pre-Training (File, URL)
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (File, DOI)
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (File, DOI)