Core AI

DPO (Direct Preference Optimisation)

Last reviewed: April 2026

A simpler alternative to RLHF that trains AI models to align with human preferences directly from comparison data, without needing a separate reward model.

Direct Preference Optimisation (DPO) is a technique for aligning AI models with human preferences that offers a simpler, more stable alternative to RLHF (Reinforcement Learning from Human Feedback). Introduced in 2023 by Stanford researchers, DPO achieves similar alignment quality with significantly less computational complexity.

The problem with RLHF

RLHF — the standard method for making AI models helpful and safe — is a complex, multi-stage process:

Collect human comparisons (which response is better?)
Train a reward model to predict human preferences
Use reinforcement learning (PPO) to optimise the language model against the reward model
Carefully balance the reward signal to avoid reward hacking

Each stage introduces potential failure modes. The reward model can be inaccurate. The reinforcement learning is unstable and sensitive to hyperparameters. The entire pipeline is expensive and requires specialised expertise.

How DPO simplifies this

DPO skips the reward model and reinforcement learning stages entirely. Instead, it reformulates the alignment problem as a straightforward supervised learning task:

Collect human comparisons (same as RLHF)
Train the language model directly on these comparisons using a modified loss function

The key insight is mathematical: DPO proves that optimising the standard RLHF objective is equivalent to a simple classification task — given a pair of responses, increase the probability of the preferred one and decrease the probability of the rejected one.

Advantages of DPO

Simplicity: One training stage instead of three. No reward model to train. No reinforcement learning to stabilise.
Stability: Standard supervised learning is far more stable than reinforcement learning, requiring less hyperparameter tuning.
Efficiency: Less compute required because there are fewer stages and no need to generate samples during training.
Accessibility: Smaller teams without RL expertise can implement preference alignment.

Limitations and trade-offs

Data efficiency: DPO may require more preference data than RLHF to achieve comparable results, because RLHF's reward model can generalise from fewer examples.
Distribution drift: DPO training can drift from the base model's distribution, sometimes producing less diverse outputs.
Online learning: DPO in its basic form cannot easily incorporate new preference data without retraining, while RLHF's reward model can be updated incrementally.

Variants and extensions

The success of DPO has spawned many variants:

IPO (Identity Preference Optimisation): Addresses DPO's tendency to overfit to preference data.
ORPO: Combines instruction tuning and preference alignment in a single training step.
KTO: Works with binary feedback (thumbs up/down) rather than pairwise comparisons.
SimPO: Simplifies DPO further by removing the need for a reference model.

Impact on the field

DPO has democratised AI alignment. Before DPO, alignment was primarily the domain of large AI labs with reinforcement learning expertise. Now, any team with preference data and standard training infrastructure can align a model to human preferences. This has accelerated the development of open-source aligned models significantly.

Want to go deeper?

This topic is covered in our Expert level. Access all 100+ lessons free.

Why This Matters

DPO is making it more accessible for organisations to create AI models aligned with their specific values and preferences. Understanding the evolving landscape of alignment techniques helps you evaluate model quality claims and appreciate why newer open-source models can rival proprietary ones in helpfulness and safety.

A training technique where human evaluators rate AI outputs and reinforcement learning optimises the model to produce responses that humans prefer.

AI Alignment

The challenge of ensuring AI systems do what humans actually want — not just what they were literally instructed to do. A core concern in AI safety research.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Reward Model

An AI model trained to predict human preferences, used to guide the training of language models toward producing outputs that humans rate as helpful and safe.

Learn More

Continue learning in Expert

This topic is covered in our lesson: Scaling AI Across the Organisation

← Back to Glossary