DPO (Direct Preference Optimisation)
A simpler alternative to RLHF that trains AI models to align with human preferences directly from comparison data, without needing a separate reward model.
Direct Preference Optimisation (DPO) is a technique for aligning AI models with human preferences that offers a simpler, more stable alternative to RLHF (Reinforcement Learning from Human Feedback). Introduced in 2023 by Stanford researchers, DPO achieves similar alignment quality with significantly less computational complexity.
The problem with RLHF
RLHF β the standard method for making AI models helpful and safe β is a complex, multi-stage process:
- Collect human comparisons (which response is better?)
- Train a reward model to predict human preferences
- Use reinforcement learning (PPO) to optimise the language model against the reward model
- Carefully balance the reward signal to avoid reward hacking
Each stage introduces potential failure modes. The reward model can be inaccurate. The reinforcement learning is unstable and sensitive to hyperparameters. The entire pipeline is expensive and requires specialised expertise.
How DPO simplifies this
DPO skips the reward model and reinforcement learning stages entirely. Instead, it reformulates the alignment problem as a straightforward supervised learning task:
- Collect human comparisons (same as RLHF)
- Train the language model directly on these comparisons using a modified loss function
The key insight is mathematical: DPO proves that optimising the standard RLHF objective is equivalent to a simple classification task β given a pair of responses, increase the probability of the preferred one and decrease the probability of the rejected one.
Advantages of DPO
- Simplicity: One training stage instead of three. No reward model to train. No reinforcement learning to stabilise.
- Stability: Standard supervised learning is far more stable than reinforcement learning, requiring less hyperparameter tuning.
- Efficiency: Less compute required because there are fewer stages and no need to generate samples during training.
- Accessibility: Smaller teams without RL expertise can implement preference alignment.
Limitations and trade-offs
- Data efficiency: DPO may require more preference data than RLHF to achieve comparable results, because RLHF's reward model can generalise from fewer examples.
- Distribution drift: DPO training can drift from the base model's distribution, sometimes producing less diverse outputs.
- Online learning: DPO in its basic form cannot easily incorporate new preference data without retraining, while RLHF's reward model can be updated incrementally.
Variants and extensions
The success of DPO has spawned many variants:
- IPO (Identity Preference Optimisation): Addresses DPO's tendency to overfit to preference data.
- ORPO: Combines instruction tuning and preference alignment in a single training step.
- KTO: Works with binary feedback (thumbs up/down) rather than pairwise comparisons.
- SimPO: Simplifies DPO further by removing the need for a reference model.
Impact on the field
DPO has democratised AI alignment. Before DPO, alignment was primarily the domain of large AI labs with reinforcement learning expertise. Now, any team with preference data and standard training infrastructure can align a model to human preferences. This has accelerated the development of open-source aligned models significantly.
Why This Matters
DPO is making it more accessible for organisations to create AI models aligned with their specific values and preferences. Understanding the evolving landscape of alignment techniques helps you evaluate model quality claims and appreciate why newer open-source models can rival proprietary ones in helpfulness and safety.
Related Terms
Continue learning in Expert
This topic is covered in our lesson: Scaling AI Across the Organisation
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β