Constitutional AI: Harmlessness from AI Feedback

Bai et al. (2022) from anthropic propose constitutional-ai, a method for training harmless AI assistants using AI-generated feedback instead of human harm labels. The approach uses a small set of natural-language principles (a “constitution”) to guide both supervised self-critique/revision and RL from AI feedback (RLAIF), reducing the need for human oversight while producing models that are both harmless and non-evasive.

Problem

Standard rlhf for harmlessness requires large volumes of human preference labels and often creates a tension between helpfulness and harmlessness: models trained to avoid harm tend to become evasive and less useful. Scaling human feedback collection is expensive and slow.

Key Contribution

A two-phase training method governed by a constitution of ~10 principles: (1) a supervised learning (SL) phase where the model critiques and revises its own harmful outputs, and (2) an RL phase using AI-generated preference labels (RLAIF) rather than human labels. This achieves a Pareto improvement on the helpfulness-harmlessness frontier.

Method

In the SL phase, an initial helpful rlhf model generates responses to red-teaming prompts; the model then critiques and revises these responses using constitutional principles, and the revised outputs fine-tune the original model (SL-CAI). In the RL phase, pairs of responses are generated and evaluated by a model guided by the constitution to form AI preference labels. A preference model is trained on this data and used as the reward signal for PPO training (RL-CAI). chain-of-thought style reasoning improves transparency of AI judgments.

Main Results

RL-CAI models are preferred by crowdworkers over models trained with human feedback labels for harmlessness.
Constitutional AI achieves a Pareto improvement: less harmful at a given level of helpfulness compared to standard rlhf models.
chain-of-thought reasoning in the AI feedback phase improves both performance and transparency of decisions.
The method uses only ~10 natural-language principles, requiring far fewer human annotations.

Limitations

The constitution is chosen in an ad hoc manner; principled methods for selecting and validating constitutional principles remain open. AI feedback may inherit biases from the models generating it. The approach was validated at 52B scale and may behave differently at other scales.

Impact

Constitutional AI introduced RLAIF as a practical alternative to human-labeled rlhf, influencing alignment approaches at anthropic (Claude) and beyond. It demonstrated that AI self-supervision, guided by explicit principles, can replace large-scale human labeling for safety training, motivating further work on scalable oversight.

Sources

Constitutional AI: Harmlessness from AI Feedback (File, DOI)

AI Research Wiki

Explorer