Core AI

Model Alignment

Last reviewed: April 2026

The process of training an AI model to behave in accordance with human values, intentions, and safety requirements.

Model alignment is the process of ensuring that an AI system's behaviour matches what its creators and users intend — that it is helpful, harmless, and honest rather than producing outputs that are dangerous, deceptive, or contrary to human values.

The alignment problem

A base language model trained only on next-token prediction has no inherent goal to be helpful or safe. It has learned to produce statistically likely text continuations, which might include harmful instructions, biased content, or manipulative language — because all of these exist in training data. Alignment is the process of steering the model away from harmful behaviour and toward helpful behaviour.

How alignment is achieved

Supervised fine-tuning (SFT): The model is trained on curated examples of ideal assistant behaviour — helpful, safe, well-structured responses to a wide range of queries.
Reinforcement learning from human feedback (RLHF): Human raters compare pairs of model outputs and indicate which is better. A reward model is trained on these preferences, and the language model is optimised to produce outputs the reward model scores highly.
Constitutional AI (CAI): The model is given a set of principles (a "constitution") and trained to evaluate and revise its own outputs against these principles, reducing the need for extensive human feedback.
Direct Preference Optimization (DPO): A simplified alternative to RLHF that directly optimises the model on preference data without training a separate reward model.

What alignment aims to achieve

Helpfulness: The model genuinely tries to assist users with their tasks.
Harmlessness: The model refuses to help with dangerous or unethical requests.
Honesty: The model communicates uncertainty, avoids fabrication, and does not mislead.
Instruction following: The model does what the user asks rather than what is merely statistically likely.

Alignment challenges

Perfect alignment is an unsolved problem. Models can be "jailbroken" with adversarial prompts. Defining "aligned" behaviour is culturally dependent. Over-alignment can make models excessively cautious and unhelpful. And as models become more capable, ensuring alignment becomes more critical and more difficult.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Model alignment determines whether AI systems are trustworthy enough for real-world deployment. Understanding alignment helps you evaluate AI products, appreciate why different models behave differently, and participate in important conversations about AI governance in your organisation.

Related Terms

Reinforcement Learning

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: How AI Models Are Trained

← Back to Glossary