Core AI

Safety Training

Last reviewed: April 2026

The process of training AI models to refuse harmful requests, avoid generating dangerous content, and behave within defined safety boundaries.

Safety training is the set of techniques used to teach AI models to avoid producing harmful, dangerous, or inappropriate outputs. It is a critical phase in developing commercial AI models, applied after initial pre-training and instruction fine-tuning.

Why safety training is necessary

A language model trained on internet text has absorbed all kinds of content — including harmful instructions, biased perspectives, manipulative language, and dangerous information. Without safety training, the model would readily generate this content when asked. Safety training teaches the model to identify harmful requests and respond appropriately.

Key safety training techniques

Red teaming: Teams of people deliberately try to make the model produce harmful outputs, discovering vulnerabilities that need to be addressed.
RLHF with safety focus: Human raters specifically evaluate model outputs for safety, and the reward model is trained to penalise harmful responses.
Constitutional AI: The model is given explicit safety principles and trained to evaluate its own outputs against them, self-correcting harmful responses.
Adversarial training: The model is exposed to adversarial prompts during training so it learns to handle manipulation attempts.
Refusal training: The model is specifically trained to recognise and decline requests for harmful content — weapons instructions, personal attack content, privacy violations, and more.

The safety-helpfulness balance

One of the central challenges in safety training is avoiding over-refusal. A model that refuses too many requests becomes unhelpful — declining legitimate medical questions, refusing to discuss historical violence in educational contexts, or blocking creative fiction involving conflict. The goal is a model that refuses genuinely harmful requests while remaining helpful for legitimate use cases.

Layers of safety

Modern AI systems typically employ multiple safety layers. Pre-training data filtering removes harmful content before training begins. Safety fine-tuning teaches the model to refuse harmful requests. Input classifiers screen user messages before they reach the model. Output classifiers check model responses before delivering them to users.

Evaluating safety

Safety is evaluated through red team exercises, automated testing suites, and ongoing monitoring of deployed models. Benchmarks test refusal rates for harmful categories and false-positive refusal rates for harmless requests.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Safety training determines how reliably AI systems avoid harmful outputs in production. Understanding it helps you evaluate AI vendors' safety practices and set appropriate expectations for how AI systems will handle sensitive topics in your organisation.

Related Terms

Reinforcement Learning

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: AI Safety and Risk Management

← Back to Glossary