Definition
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences by training a reward model on human comparison data and then optimizing the language model against that reward using reinforcement learning (typically PPO).
Key Intuition
Language model pretraining optimizes for next-token prediction, not for being helpful, truthful, or safe. RLHF bridges this gap by learning what humans actually prefer and using that signal to steer the model’s behavior, going beyond what can be captured by static datasets alone.
History/Origin
Christiano et al. (2017) introduced RLHF for simple RL tasks. Stiennon et al. (2020) applied it to summarization. instructgpt (Ouyang et al., 2022) at openai demonstrated the full three-step pipeline at scale: (1) supervised fine-tuning (SFT) on demonstrations, (2) reward model training on human preference comparisons, and (3) policy optimization with Proximal Policy Optimization (PPO) against the reward model. The striking result was that a 1.3B RLHF-tuned model was preferred by human raters over the 175B base GPT-3.
Relationship to Other Concepts
RLHF builds on fine-tuning (the SFT step) and pretraining (the base model). constitutional-ai replaces human labels with AI-generated critiques (RLAIF). direct-preference-optimization eliminates the explicit reward model and RL loop, offering a simpler alternative. RLHF is a core component of the alignment pipeline used in ChatGPT, Claude, and other deployed systems.
Notable Results
InstructGPT demonstrated that RLHF dramatically reduces harmful outputs and hallucination while improving instruction-following. The 1.3B RLHF model outperforming the 175B base model showed that alignment training is far more parameter-efficient than scaling alone for user-facing quality.
Open Questions
- Reward model overoptimization (Goodhart’s law) and how to detect and prevent it.
- Whether RLHF truly aligns models with human values or merely with surface preferences.
- Scalable oversight: how to provide reliable human feedback on tasks that exceed human expertise.