Core AI

Constitutional AI (CAI)

Last reviewed: April 2026

An AI safety technique developed by Anthropic where the model is trained to follow a set of principles (a 'constitution') to self-correct harmful or unhelpful outputs.

Constitutional AI (CAI) is an approach to AI safety developed by Anthropic — the company behind Claude. Instead of relying entirely on human reviewers to flag harmful outputs, CAI gives the model a set of written principles (its "constitution") and trains it to critique and revise its own responses according to those principles.

The problem CAI solves

Traditional AI safety relies heavily on reinforcement learning from human feedback (RLHF), where human reviewers rate model outputs as helpful or harmful. This approach has limitations:

Scale: Hiring enough human reviewers to evaluate millions of model outputs is expensive and slow.
Consistency: Different reviewers may disagree about what counts as harmful.
Coverage: Reviewers cannot anticipate every harmful scenario in advance.
Transparency: The criteria used to judge outputs are implicit in the reviewers' judgements rather than explicitly stated.

How Constitutional AI works

The CAI process has two phases:

Supervised learning phase: The model generates responses, then is asked to critique its own output against each principle in the constitution. It then revises its response based on its own critique. These revised responses become training data.

Reinforcement learning phase: Instead of using human preferences to train the reward model, CAI uses the model's own judgements about which responses better satisfy the constitution. This is called RLAIF — reinforcement learning from AI feedback.

What goes in the constitution

The constitution is a set of explicit principles — for example:

Choose the response that is most helpful to the human
Choose the response that is least likely to cause harm
Choose the response that is most honest and does not present false information as fact
Choose the response that best respects human autonomy and dignity

Advantages of the constitutional approach

Transparency: The principles are written down and can be reviewed, debated, and updated.
Scalability: AI-generated feedback scales far more easily than human feedback.
Consistency: The same principles are applied uniformly across all evaluations.
Iterability: Updating the constitution is simpler than retraining human reviewers.

Limitations

CAI is not a complete solution. The model's ability to apply the constitution depends on its understanding of the principles, which may be imperfect. The constitution itself may contain gaps or conflicts. And there remains a need for human oversight to verify that the system is working as intended.

Why this matters beyond Anthropic

The constitutional approach influenced the broader AI safety field by demonstrating that AI models can meaningfully participate in their own alignment process. It moved the conversation from "how do we control AI externally" to "how do we build AI that wants to be safe."

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Constitutional AI represents one of the most practical approaches to making AI systems safer at scale. Understanding it helps you evaluate the safety claims of different AI providers and appreciate why some models handle sensitive topics more carefully than others.

A training technique where human evaluators rate AI outputs and reinforcement learning optimises the model to produce responses that humans prefer.

AI Safety

The field of research and practice dedicated to ensuring AI systems behave as intended and do not cause unintended harm.

AI Alignment

The challenge of ensuring AI systems do what humans actually want — not just what they were literally instructed to do. A core concern in AI safety research.

Guardrails

Constraints, rules, and safety mechanisms built into AI systems to prevent harmful, incorrect, or out-of-scope outputs and actions.

Related Comparisons

Anthropic vs OpenAI

Anthropic and OpenAI compared as AI companies and platforms — models, APIs, safety philosophy, developer experience, pricing, and enterprise features.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Safety and Responsible Deployment

← Back to Glossary

Constitutional AI (CAI)

Last reviewed: April 2026

An AI safety technique developed by Anthropic where the model is trained to follow a set of principles (a 'constitution') to self-correct harmful or unhelpful outputs.

The problem CAI solves

Traditional AI safety relies heavily on reinforcement learning from human feedback (RLHF), where human reviewers rate model outputs as helpful or harmful. This approach has limitations:

Scale: Hiring enough human reviewers to evaluate millions of model outputs is expensive and slow.
Consistency: Different reviewers may disagree about what counts as harmful.
Coverage: Reviewers cannot anticipate every harmful scenario in advance.
Transparency: The criteria used to judge outputs are implicit in the reviewers' judgements rather than explicitly stated.

How Constitutional AI works

The CAI process has two phases:

Supervised learning phase: The model generates responses, then is asked to critique its own output against each principle in the constitution. It then revises its response based on its own critique. These revised responses become training data.

Reinforcement learning phase: Instead of using human preferences to train the reward model, CAI uses the model's own judgements about which responses better satisfy the constitution. This is called RLAIF — reinforcement learning from AI feedback.

What goes in the constitution

The constitution is a set of explicit principles — for example:

Choose the response that is most helpful to the human
Choose the response that is least likely to cause harm
Choose the response that is most honest and does not present false information as fact
Choose the response that best respects human autonomy and dignity

Advantages of the constitutional approach

Transparency: The principles are written down and can be reviewed, debated, and updated.
Scalability: AI-generated feedback scales far more easily than human feedback.
Consistency: The same principles are applied uniformly across all evaluations.
Iterability: Updating the constitution is simpler than retraining human reviewers.

Limitations

Why this matters beyond Anthropic

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Safety and Responsible Deployment

Constitutional AI (CAI)

The problem CAI solves

How Constitutional AI works

What goes in the constitution

Advantages of the constitutional approach

Limitations

Why this matters beyond Anthropic

Why This Matters

Related Terms

Related Comparisons

Continue learning in Advanced

Constitutional AI (CAI)

The problem CAI solves

How Constitutional AI works

What goes in the constitution

Advantages of the constitutional approach

Limitations

Why this matters beyond Anthropic

Why This Matters

Related Terms

Related Comparisons

Continue learning in Advanced