Core AI

Reward Model

Last reviewed: April 2026

An AI model trained to predict human preferences, used to guide the training of language models toward producing outputs that humans rate as helpful and safe.

A reward model is a machine learning model that scores AI outputs based on predicted human preferences. It serves as an automated proxy for human judgement during the reinforcement learning phase of language model training.

The role in RLHF

Reinforcement learning from human feedback (RLHF) is the process that transforms a base language model into a helpful assistant. It requires a signal indicating which outputs are "good" and which are "bad." Collecting this signal from humans for every training example is impossibly expensive. The reward model solves this by learning to predict how humans would rate any given output, enabling millions of training iterations with automated feedback.

How reward models are trained

Human annotators are shown pairs of model outputs for the same prompt and asked to indicate which response they prefer. These preference pairs become the training data for the reward model. The reward model learns to assign higher scores to outputs that humans tend to prefer and lower scores to outputs they reject.

The training dataset typically includes thousands to millions of comparison pairs covering diverse topics, difficulty levels, and response styles.

How reward models are used

Once trained, the reward model provides the optimization signal for reinforcement learning. The language model generates a response, the reward model scores it, and the RL algorithm adjusts the language model to produce responses that receive higher scores. This loop runs for many iterations, gradually improving the language model's outputs.

Challenges with reward models

Reward hacking: The language model may find unexpected ways to achieve high reward scores without actually being more helpful — for example, producing verbose responses if the reward model slightly favors length.
Preference inconsistency: Different humans have different preferences, and the reward model must navigate these disagreements.
Distribution shift: The reward model was trained on outputs from an earlier version of the language model. As the language model improves, it may generate outputs the reward model has not seen, leading to unreliable scores.
Proxy alignment: The reward model captures an approximation of human values, not the values themselves. Over-optimizing against this proxy can produce unintended behaviours.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Reward models are a key component in making AI assistants helpful and safe. Understanding how they work — and their limitations — helps you appreciate why different AI models behave differently and why alignment is an ongoing challenge rather than a solved problem.

Related Terms

Reinforcement Learning

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How AI Models Are Trained

← Back to Glossary