Core AI

Adversarial Attack

Last reviewed: April 2026

A deliberate attempt to fool an AI model by crafting inputs specifically designed to cause incorrect or harmful outputs.

An adversarial attack is a technique where someone intentionally manipulates the input to an AI system to make it produce wrong, unexpected, or harmful results. These attacks exploit weaknesses in how models process information.

Why AI models are vulnerable

Machine learning models learn patterns from training data, but they do not understand the world the way humans do. A model that recognises stop signs might be fooled by a small sticker placed on the sign — imperceptible to humans but enough to change the model's classification. Language models can be tricked by carefully worded prompts that bypass safety guardrails.

Types of adversarial attacks

Evasion attacks: Modifying inputs at inference time to cause misclassification. Adding invisible perturbations to images or subtly rewording text to change a model's output.
Prompt injection: Inserting hidden instructions into text that an AI processes, hijacking its behaviour. For example, hiding "ignore previous instructions" within a document the AI is asked to summarize.
Data poisoning: Corrupting training data so the model learns incorrect patterns from the start.
Model extraction: Querying a model systematically to reconstruct its behaviour and find exploitable patterns.

Examples in practice

Researchers have shown that tiny pixel changes in images can make classifiers see a rifle as a turtle. Adversarial patches on road signs have fooled autonomous vehicle systems. Prompt injection attacks have made chatbots reveal system prompts or produce prohibited content.

Defending against adversarial attacks

Adversarial training: Including adversarial examples in the training data so the model learns to handle them.
Input validation: Filtering and sanitising inputs before they reach the model.
Safety training: Teaching language models to recognise and refuse manipulation attempts.
Red teaming: Deliberately attacking your own system to find vulnerabilities before bad actors do.

Why this is an ongoing challenge

Adversarial robustness is fundamentally difficult because the space of possible inputs is vast. Every defence creates a new surface for attackers to probe. This is why AI safety is an active research area with no permanent solution — it requires continuous vigilance.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Adversarial attacks are a critical concern for any organisation deploying AI in production. Understanding these risks helps you build more robust systems, implement appropriate safeguards, and make informed decisions about where AI can be trusted in high-stakes applications.

Related Terms

Prompt Engineering

The skill of writing instructions to AI that consistently produce useful, accurate, high-quality output.

Hallucination

When AI generates confident but incorrect information. The AI is not lying — it is producing statistically plausible text that happens to be wrong.

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Safety and Risk Management

← Back to Glossary