Core AI

Reinforcement Learning from Human Feedback (RLHF)

Last reviewed: April 2026

A training technique where human evaluators rate AI outputs and reinforcement learning optimises the model to produce responses that humans prefer.

Reinforcement learning from human feedback is the training technique that transforms a raw language model into a helpful, harmless, and honest AI assistant. It is the process that turns a model that merely predicts the next word into one that actually tries to be useful.

The three-stage training pipeline

Modern AI assistants like ChatGPT and Claude are built in stages:

Pre-training: The model learns language by predicting the next token across billions of text examples. After this stage, the model can complete text fluently but has no notion of being helpful or safe.
Supervised fine-tuning: Human trainers write example conversations showing how an ideal assistant should respond. The model learns to mimic this behaviour.
RLHF: Human evaluators compare multiple model outputs for the same prompt and rank them by quality. A reward model learns from these rankings, then reinforcement learning optimises the language model to produce outputs the reward model scores highly.

Why RLHF is necessary

Pre-training alone produces a model that is impressive at generating text but unreliable as an assistant. It might produce harmful content, make up facts confidently, or give unhelpful responses. Supervised fine-tuning helps but is limited by the volume of human-written examples. RLHF scales the alignment process by training a reward model that can evaluate millions of outputs without requiring a human to review each one.

How the reward model works

Human evaluators are shown the same prompt with two or more model responses and asked which is better. From thousands of these comparisons, a reward model learns to predict which outputs humans prefer. This reward model then provides the signal for reinforcement learning — the language model is optimised to generate responses that the reward model scores highly.

Limitations and challenges

Reward hacking: The model may learn to produce outputs that score well on the reward model without genuinely being better — for example, writing longer responses because evaluators tend to prefer them.
Evaluator disagreement: Different humans have different preferences, and controversial topics may have no clear "better" answer.
Constitutional AI: Anthropic (the company behind Claude) developed an alternative approach called Constitutional AI, which uses a set of written principles to guide the model's behaviour, reducing reliance on human evaluation.

The impact on AI behaviour

RLHF is the reason modern AI assistants are polite, refuse harmful requests, acknowledge uncertainty, and try to be helpful. It is arguably the most consequential training technique in making AI safe and useful for widespread deployment.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

RLHF explains why different AI models behave differently despite similar underlying technology. Understanding this technique helps you evaluate why certain models are more appropriate for business use, why AI companies emphasise safety and alignment, and why model behaviour can shift between versions.

Related Terms

Reinforcement Learning

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

AI Alignment

The challenge of ensuring AI systems do what humans actually want — not just what they were literally instructed to do. A core concern in AI safety research.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How AI Models Are Trained and Aligned

← Back to Glossary