Transfer Learning

Definition

Transfer learning is the practice of leveraging knowledge acquired from one task or domain to improve performance on a different but related task. In modern NLP, this takes the form of pretraining a model on a large general corpus, then fine-tuning it on a specific downstream task.

Key Intuition

Language has shared structure across tasks: syntax, semantics, world knowledge, and reasoning patterns. A model trained on a massive text corpus learns general-purpose representations that transfer to tasks with far less labeled data. This is analogous to how ImageNet pretraining revolutionized computer vision.

History/Origin

Transfer learning has roots in machine learning theory from the 1990s, but became transformative for NLP in 2018. Howard and Ruger’s ULMFiT (2018) demonstrated that pretraining a language model and carefully fine-tuning it could achieve state-of-the-art results. gpt-1 (Radford et al., 2018) showed that generative pretraining followed by discriminative fine-tuning worked across diverse tasks. bert (Devlin et al., 2018) achieved dramatic improvements through masked language model pretraining. t5 (Raffel et al., 2019) unified all tasks into a text-to-text framework, systematically studying transfer learning design choices.

Relationship to Other Concepts

Transfer learning depends on pretraining to build reusable representations and fine-tuning to adapt them. word-embeddings were an early, limited form of transfer. in-context-learning represents a different transfer paradigm where task adaptation happens at inference time without gradient updates. low-rank-adaptation enables parameter-efficient transfer.

Notable Results

BERT fine-tuned on GLUE surpassed human baselines on several benchmarks. GPT-1 showed a single pretrained model could be adapted to 9 of 12 tasks with minimal architecture changes. T5-11B achieved state-of-the-art on SuperGLUE, SQuAD, and summarization simultaneously.

Open Questions

How to transfer effectively across languages and modalities.
Whether negative transfer (where pretraining hurts) can be predicted and avoided.
The degree to which scale alone drives transferability versus architectural or training choices.

Sources

Improving Language Understanding by Generative Pre-Training (File, URL)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (File, DOI)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (File, DOI)

AI Research Wiki

Explorer