Model Alignment
The process of training an AI model to behave in accordance with human values, intentions, and safety requirements.
Model alignment is the process of ensuring that an AI system's behaviour matches what its creators and users intend β that it is helpful, harmless, and honest rather than producing outputs that are dangerous, deceptive, or contrary to human values.
The alignment problem
A base language model trained only on next-token prediction has no inherent goal to be helpful or safe. It has learned to produce statistically likely text continuations, which might include harmful instructions, biased content, or manipulative language β because all of these exist in training data. Alignment is the process of steering the model away from harmful behaviour and toward helpful behaviour.
How alignment is achieved
- Supervised fine-tuning (SFT): The model is trained on curated examples of ideal assistant behaviour β helpful, safe, well-structured responses to a wide range of queries.
- Reinforcement learning from human feedback (RLHF): Human raters compare pairs of model outputs and indicate which is better. A reward model is trained on these preferences, and the language model is optimised to produce outputs the reward model scores highly.
- Constitutional AI (CAI): The model is given a set of principles (a "constitution") and trained to evaluate and revise its own outputs against these principles, reducing the need for extensive human feedback.
- Direct Preference Optimization (DPO): A simplified alternative to RLHF that directly optimises the model on preference data without training a separate reward model.
What alignment aims to achieve
- Helpfulness: The model genuinely tries to assist users with their tasks.
- Harmlessness: The model refuses to help with dangerous or unethical requests.
- Honesty: The model communicates uncertainty, avoids fabrication, and does not mislead.
- Instruction following: The model does what the user asks rather than what is merely statistically likely.
Alignment challenges
Perfect alignment is an unsolved problem. Models can be "jailbroken" with adversarial prompts. Defining "aligned" behaviour is culturally dependent. Over-alignment can make models excessively cautious and unhelpful. And as models become more capable, ensuring alignment becomes more critical and more difficult.
Why This Matters
Model alignment determines whether AI systems are trustworthy enough for real-world deployment. Understanding alignment helps you evaluate AI products, appreciate why different models behave differently, and participate in important conversations about AI governance in your organisation.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: How AI Models Are Trained