Reinforcement Learning from Human Feedback

Definition

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences by training a reward model on human comparison data and then optimizing the language model against that reward using reinforcement learning (typically PPO).

Key Intuition

Language model pretraining optimizes for next-token prediction, not for being helpful, truthful, or safe. RLHF bridges this gap by learning what humans actually prefer and using that signal to steer the model’s behavior, going beyond what can be captured by static datasets alone.

History/Origin

Christiano et al. (2017) introduced RLHF for simple RL tasks. Stiennon et al. (2020) applied it to summarization. instructgpt (Ouyang et al., 2022) at openai demonstrated the full three-step pipeline at scale: (1) supervised fine-tuning (SFT) on demonstrations, (2) reward model training on human preference comparisons, and (3) policy optimization with Proximal Policy Optimization (PPO) against the reward model. The striking result was that a 1.3B RLHF-tuned model was preferred by human raters over the 175B base GPT-3.

Relationship to Other Concepts

RLHF builds on fine-tuning (the SFT step) and pretraining (the base model). constitutional-ai replaces human labels with AI-generated critiques (RLAIF). direct-preference-optimization eliminates the explicit reward model and RL loop, offering a simpler alternative. RLHF is a core component of the alignment pipeline used in ChatGPT, Claude, and other deployed systems.

Notable Results

InstructGPT demonstrated that RLHF dramatically reduces harmful outputs and hallucination while improving instruction-following. The 1.3B RLHF model outperforming the 175B base model showed that alignment training is far more parameter-efficient than scaling alone for user-facing quality.

Open Questions

Reward model overoptimization (Goodhart’s law) and how to detect and prevent it.
Whether RLHF truly aligns models with human values or merely with surface preferences.
Scalable oversight: how to provide reliable human feedback on tasks that exceed human expertise.

Sources

Training language models to follow instructions with human feedback (File, DOI)

AI Research Wiki

Explorer