Perplexity (Metric)
A measurement of how well a language model predicts text β lower perplexity means the model is less surprised by the data and generally more capable.
Perplexity is a standard metric for evaluating language models. It measures how "surprised" a model is by a sequence of text. A model with lower perplexity is better at predicting what comes next in a text, which generally means it has a stronger understanding of language.
The intuition behind perplexity
Imagine you are playing a word-guessing game. After reading "The cat sat on the ___", you would confidently predict "mat" or "chair." You would be surprised by "quantum." Perplexity quantifies this surprise across an entire test dataset. A model that consistently makes good predictions has low perplexity.
The maths (simplified)
Perplexity is calculated as the exponential of the average negative log-likelihood. In simpler terms:
- For each word in the test text, the model assigns a probability to the correct next word.
- These probabilities are averaged across all words.
- The result is converted to a perplexity score.
A perplexity of 1 would mean the model perfectly predicts every word (impossible in practice). A perplexity of 100 means the model is, on average, as uncertain as if it were choosing between 100 equally likely options.
What perplexity tells you
- Lower is better: A model with perplexity 15 on a test set is better at language modelling than one with perplexity 30.
- Relative comparisons: Perplexity is most useful for comparing models evaluated on the same test set.
- Not a complete picture: Low perplexity does not guarantee the model will be good at following instructions, reasoning, or generating useful content.
Limitations of perplexity
- Dataset dependent: Perplexity on one dataset may not reflect performance on another. A model with low perplexity on news articles might score poorly on code.
- Not a quality metric: A model can have low perplexity (predicts well) but still produce unhelpful or harmful responses.
- Not comparable across tokenisers: Models using different tokenisers cannot be directly compared using perplexity.
- Gaming risk: Optimising solely for low perplexity can lead to models that are statistically accurate but not practically useful.
Perplexity in practice
When reading AI research papers or model evaluations, perplexity is one of several metrics you will encounter. It is a starting point, not a final verdict. For practical model selection, real-world task evaluations and human preference ratings are more informative.
Why This Matters
Perplexity appears frequently in AI research and model comparisons. Understanding it helps you read model evaluation reports and technical announcements without confusion. However, knowing its limitations is equally important β do not select a model based on perplexity alone.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Understanding Model Evaluation