Core AI

ROUGE Score

Last reviewed: April 2026

A set of metrics for evaluating the quality of AI-generated summaries by measuring how much they overlap with human-written reference summaries.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics used to evaluate the quality of text generated by AI, particularly summaries. It works by measuring the overlap between the AI-generated text and one or more human-written reference texts.

How ROUGE works

ROUGE fundamentally measures how much of the reference text's content appears in the generated text. The basic approach is to count matching units (words, phrases, or sequences) between the generated and reference texts.

ROUGE variants

ROUGE-1: Measures the overlap of individual words (unigrams). If the reference contains 20 unique words and the generated summary shares 15 of them, ROUGE-1 recall is 75%.
ROUGE-2: Measures the overlap of consecutive word pairs (bigrams). This captures some phrase-level similarity, not just individual words.
ROUGE-L: Measures the longest common subsequence between the generated and reference texts. This captures sentence-level structure without requiring consecutive matches.
ROUGE-S: Measures skip-bigram overlap — pairs of words that appear in the same order but not necessarily consecutively.

Precision, recall, and F1 in ROUGE

Each ROUGE variant can be expressed as:

Recall: What proportion of the reference text is captured by the generated text?
Precision: What proportion of the generated text is relevant (appears in the reference)?
F1: The harmonic mean of precision and recall, balancing both.

For summarisation, recall is often emphasised — we want the summary to capture the key content from the reference.

Limitations of ROUGE

ROUGE has significant limitations that are important to understand:

Surface-level matching: ROUGE only measures word overlap, not meaning. Two sentences can express the same idea with completely different words, scoring poorly on ROUGE despite being semantically equivalent.
Reference dependency: ROUGE requires human reference summaries. If the reference is poor, ROUGE scores are meaningless.
No fluency assessment: A grammatically broken sentence can score highly on ROUGE if it contains the right words.
Single correct answer assumption: There are many valid ways to summarise a text. ROUGE penalises valid alternatives that differ from the reference.

ROUGE in practice

Despite its limitations, ROUGE remains widely used because:

It is fast, cheap, and reproducible — essential for comparing many models across many datasets.
It correlates reasonably well with human judgements for factual summarisation tasks.
It provides a standardised benchmark that allows comparison across research papers.
It is a useful first filter — if ROUGE scores are very low, the model is likely performing poorly.

Beyond ROUGE

Modern evaluation increasingly supplements ROUGE with:

BERTScore: Uses embeddings to measure semantic similarity rather than surface word overlap.
Human evaluation: Still the gold standard for assessing quality, fluency, and factual accuracy.
LLM-as-judge: Using a powerful language model to evaluate the output of another model.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

ROUGE scores appear frequently in AI product claims and research papers. Understanding what they measure — and what they miss — helps you critically evaluate claims about AI summarisation quality and avoid being misled by impressive-sounding numbers that may not reflect real-world performance.

Related Terms

BLEU Score

A metric that evaluates the quality of machine-generated text by measuring how closely it matches one or more human reference texts.

Benchmark

A standardised test or dataset used to measure and compare the performance of different AI models on specific tasks.

Model Evaluation

The systematic process of measuring an AI model's performance using held-out data and appropriate metrics to determine whether it is good enough for its intended use.

Summarisation

The AI task of condensing long text into a shorter version that captures the key points, saving readers time while preserving essential information.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow

← Back to Glossary