Core AI

Perplexity (Metric)

Last reviewed: April 2026

A measurement of how well a language model predicts text — lower perplexity means the model is less surprised by the data and generally more capable.

Perplexity is a standard metric for evaluating language models. It measures how "surprised" a model is by a sequence of text. A model with lower perplexity is better at predicting what comes next in a text, which generally means it has a stronger understanding of language.

The intuition behind perplexity

Imagine you are playing a word-guessing game. After reading "The cat sat on the ___", you would confidently predict "mat" or "chair." You would be surprised by "quantum." Perplexity quantifies this surprise across an entire test dataset. A model that consistently makes good predictions has low perplexity.

The maths (simplified)

Perplexity is calculated as the exponential of the average negative log-likelihood. In simpler terms:

For each word in the test text, the model assigns a probability to the correct next word.
These probabilities are averaged across all words.
The result is converted to a perplexity score.

A perplexity of 1 would mean the model perfectly predicts every word (impossible in practice). A perplexity of 100 means the model is, on average, as uncertain as if it were choosing between 100 equally likely options.

What perplexity tells you

Lower is better: A model with perplexity 15 on a test set is better at language modelling than one with perplexity 30.
Relative comparisons: Perplexity is most useful for comparing models evaluated on the same test set.
Not a complete picture: Low perplexity does not guarantee the model will be good at following instructions, reasoning, or generating useful content.

Limitations of perplexity

Dataset dependent: Perplexity on one dataset may not reflect performance on another. A model with low perplexity on news articles might score poorly on code.
Not a quality metric: A model can have low perplexity (predicts well) but still produce unhelpful or harmful responses.
Not comparable across tokenisers: Models using different tokenisers cannot be directly compared using perplexity.
Gaming risk: Optimising solely for low perplexity can lead to models that are statistically accurate but not practically useful.

Perplexity in practice

When reading AI research papers or model evaluations, perplexity is one of several metrics you will encounter. It is a starting point, not a final verdict. For practical model selection, real-world task evaluations and human preference ratings are more informative.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Perplexity appears frequently in AI research and model comparisons. Understanding it helps you read model evaluation reports and technical announcements without confusion. However, knowing its limitations is equally important — do not select a model based on perplexity alone.

Related Terms

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Model Card

A standardised document that describes an AI model's capabilities, limitations, training data, intended use, and known biases.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Understanding Model Evaluation

← Back to Glossary