Reinforcement Learning from Human Feedback (RLHF)
A training technique where human evaluators rate AI outputs and reinforcement learning optimises the model to produce responses that humans prefer.
Reinforcement learning from human feedback is the training technique that transforms a raw language model into a helpful, harmless, and honest AI assistant. It is the process that turns a model that merely predicts the next word into one that actually tries to be useful.
The three-stage training pipeline
Modern AI assistants like ChatGPT and Claude are built in stages:
- Pre-training: The model learns language by predicting the next token across billions of text examples. After this stage, the model can complete text fluently but has no notion of being helpful or safe.
- Supervised fine-tuning: Human trainers write example conversations showing how an ideal assistant should respond. The model learns to mimic this behaviour.
- RLHF: Human evaluators compare multiple model outputs for the same prompt and rank them by quality. A reward model learns from these rankings, then reinforcement learning optimises the language model to produce outputs the reward model scores highly.
Why RLHF is necessary
Pre-training alone produces a model that is impressive at generating text but unreliable as an assistant. It might produce harmful content, make up facts confidently, or give unhelpful responses. Supervised fine-tuning helps but is limited by the volume of human-written examples. RLHF scales the alignment process by training a reward model that can evaluate millions of outputs without requiring a human to review each one.
How the reward model works
Human evaluators are shown the same prompt with two or more model responses and asked which is better. From thousands of these comparisons, a reward model learns to predict which outputs humans prefer. This reward model then provides the signal for reinforcement learning β the language model is optimised to generate responses that the reward model scores highly.
Limitations and challenges
- Reward hacking: The model may learn to produce outputs that score well on the reward model without genuinely being better β for example, writing longer responses because evaluators tend to prefer them.
- Evaluator disagreement: Different humans have different preferences, and controversial topics may have no clear "better" answer.
- Constitutional AI: Anthropic (the company behind Claude) developed an alternative approach called Constitutional AI, which uses a set of written principles to guide the model's behaviour, reducing reliance on human evaluation.
The impact on AI behaviour
RLHF is the reason modern AI assistants are polite, refuse harmful requests, acknowledge uncertainty, and try to be helpful. It is arguably the most consequential training technique in making AI safe and useful for widespread deployment.
Why This Matters
RLHF explains why different AI models behave differently despite similar underlying technology. Understanding this technique helps you evaluate why certain models are more appropriate for business use, why AI companies emphasise safety and alignment, and why model behaviour can shift between versions.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: How AI Models Are Trained and Aligned