Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Safety Training

Last reviewed: April 2026

The process of training AI models to refuse harmful requests, avoid generating dangerous content, and behave within defined safety boundaries.

Safety training is the set of techniques used to teach AI models to avoid producing harmful, dangerous, or inappropriate outputs. It is a critical phase in developing commercial AI models, applied after initial pre-training and instruction fine-tuning.

Why safety training is necessary

A language model trained on internet text has absorbed all kinds of content β€” including harmful instructions, biased perspectives, manipulative language, and dangerous information. Without safety training, the model would readily generate this content when asked. Safety training teaches the model to identify harmful requests and respond appropriately.

Key safety training techniques

  • Red teaming: Teams of people deliberately try to make the model produce harmful outputs, discovering vulnerabilities that need to be addressed.
  • RLHF with safety focus: Human raters specifically evaluate model outputs for safety, and the reward model is trained to penalise harmful responses.
  • Constitutional AI: The model is given explicit safety principles and trained to evaluate its own outputs against them, self-correcting harmful responses.
  • Adversarial training: The model is exposed to adversarial prompts during training so it learns to handle manipulation attempts.
  • Refusal training: The model is specifically trained to recognise and decline requests for harmful content β€” weapons instructions, personal attack content, privacy violations, and more.

The safety-helpfulness balance

One of the central challenges in safety training is avoiding over-refusal. A model that refuses too many requests becomes unhelpful β€” declining legitimate medical questions, refusing to discuss historical violence in educational contexts, or blocking creative fiction involving conflict. The goal is a model that refuses genuinely harmful requests while remaining helpful for legitimate use cases.

Layers of safety

Modern AI systems typically employ multiple safety layers. Pre-training data filtering removes harmful content before training begins. Safety fine-tuning teaches the model to refuse harmful requests. Input classifiers screen user messages before they reach the model. Output classifiers check model responses before delivering them to users.

Evaluating safety

Safety is evaluated through red team exercises, automated testing suites, and ongoing monitoring of deployed models. Benchmarks test refusal rates for harmful categories and false-positive refusal rates for harmless requests.

Want to go deeper?
This topic is covered in our Practitioner level. Access all 60+ lessons free.

Why This Matters

Safety training determines how reliably AI systems avoid harmful outputs in production. Understanding it helps you evaluate AI vendors' safety practices and set appropriate expectations for how AI systems will handle sensitive topics in your organisation.

Related Terms

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: AI Safety and Risk Management