InstructGPT: Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. (2022) from openai demonstrate that fine-tuning gpt models with rlhf produces language models far better aligned with user intent than raw scale alone. The 1.3B parameter InstructGPT model is preferred by human labelers over the 175B gpt-3, despite having over 100x fewer parameters.

Problem

Large language models trained on next-token prediction are misaligned with user intent: they fabricate facts, produce toxic text, and fail to follow instructions. Simply scaling model size does not resolve these issues.

Key Contribution

A three-step rlhf pipeline for aligning language models: (1) supervised fine-tuning (SFT) on human demonstrations, (2) reward model (RM) training on human preference rankings, and (3) reinforcement learning via PPO against the reward model. The resulting InstructGPT models set the template for subsequent alignment work.

Method

Starting from GPT-3 (1.3B, 6B, 175B variants), the authors collect labeler-written demonstrations to train an SFT model, then gather comparison data where labelers rank model outputs to train a reward model. The SFT model is further fine-tuned with PPO to maximize the learned reward. A variant (PPO-ptx) mixes PPO updates with pretraining log-likelihood updates to reduce the “alignment tax” on standard NLP benchmarks.

Main Results

175B InstructGPT preferred over 175B GPT-3 85% of the time, and over few-shot GPT-3 71% of the time.
TruthfulQA: InstructGPT generates truthful and informative answers roughly twice as often as GPT-3.
Hallucination rate drops from 41% (GPT-3) to 21% (InstructGPT) on closed-domain tasks.
25% reduction in toxic outputs when prompted to be respectful.

Limitations

InstructGPT still makes simple mistakes, can fail to follow instructions, and does not significantly improve on bias benchmarks (Winogender, CrowSPairs). Alignment is to a specific labeler population, not a universal notion of human values.

Impact

InstructGPT established rlhf as the dominant paradigm for aligning LLMs, directly influencing the training of ChatGPT and motivating alternatives like constitutional-ai-paper and dpo. The finding that a small aligned model can beat a 100x larger unaligned model reshaped how the field thinks about scaling-laws and alignment.

Sources

Training language models to follow instructions with human feedback (File, DOI)

AI Research Wiki

Explorer