Core AI

Model Interpretability

Last reviewed: April 2026

The ability to understand and explain how an AI model arrives at its predictions or outputs.

Model interpretability is the degree to which humans can understand why an AI model produces a particular output. An interpretable model is one whose decision-making process can be examined, explained, and trusted — or corrected when it goes wrong.

The black box problem

Deep neural networks are often called "black boxes." They take inputs, perform billions of mathematical operations across millions of parameters, and produce outputs. Understanding exactly why a particular input led to a particular output is extremely difficult. This opacity creates problems in high-stakes applications where explanations are required — healthcare, finance, criminal justice, and hiring.

Levels of interpretability

Transparent models: Simple models like decision trees and linear regression are inherently interpretable. You can trace exactly how each input feature contributes to the output.
Post-hoc explanations: Techniques applied after a complex model makes a prediction to generate human-understandable explanations.
Mechanistic interpretability: Deep research into understanding what individual neurons, layers, and circuits within a neural network actually compute.

Key interpretability techniques

SHAP (SHapley Additive exPlanations): Assigns each input feature a contribution score for a specific prediction based on game theory.
LIME (Local Interpretable Model-agnostic Explanations): Builds a simple, interpretable model that approximates the complex model's behaviour for a specific input.
Attention visualization: Examining which parts of the input a transformer model attends to most when generating output.
Probing: Training small classifiers on a model's internal representations to discover what information is encoded at different layers.
Feature attribution: Identifying which input features most influenced a particular output.

Interpretability for language models

For large language models, interpretability research focuses on understanding what concepts are represented in different parts of the model, why the model generates particular responses, and how to identify and correct problematic behaviours. This field — sometimes called "mechanistic interpretability" — is one of the most active areas of AI safety research.

Regulation and interpretability

Regulations like the EU AI Act increasingly require that AI systems used in high-risk applications provide explanations for their decisions. This is making interpretability not just a research interest but a business requirement.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Model interpretability is critical for building trust in AI systems and meeting regulatory requirements. Understanding interpretability helps you assess where AI can be deployed responsibly and what questions to ask when an AI system makes a decision that affects people.

Related Terms

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Governance and Compliance

← Back to Glossary