ROUGE Score
A set of metrics for evaluating the quality of AI-generated summaries by measuring how much they overlap with human-written reference summaries.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics used to evaluate the quality of text generated by AI, particularly summaries. It works by measuring the overlap between the AI-generated text and one or more human-written reference texts.
How ROUGE works
ROUGE fundamentally measures how much of the reference text's content appears in the generated text. The basic approach is to count matching units (words, phrases, or sequences) between the generated and reference texts.
ROUGE variants
- ROUGE-1: Measures the overlap of individual words (unigrams). If the reference contains 20 unique words and the generated summary shares 15 of them, ROUGE-1 recall is 75%.
- ROUGE-2: Measures the overlap of consecutive word pairs (bigrams). This captures some phrase-level similarity, not just individual words.
- ROUGE-L: Measures the longest common subsequence between the generated and reference texts. This captures sentence-level structure without requiring consecutive matches.
- ROUGE-S: Measures skip-bigram overlap β pairs of words that appear in the same order but not necessarily consecutively.
Precision, recall, and F1 in ROUGE
Each ROUGE variant can be expressed as:
- Recall: What proportion of the reference text is captured by the generated text?
- Precision: What proportion of the generated text is relevant (appears in the reference)?
- F1: The harmonic mean of precision and recall, balancing both.
For summarisation, recall is often emphasised β we want the summary to capture the key content from the reference.
Limitations of ROUGE
ROUGE has significant limitations that are important to understand:
- Surface-level matching: ROUGE only measures word overlap, not meaning. Two sentences can express the same idea with completely different words, scoring poorly on ROUGE despite being semantically equivalent.
- Reference dependency: ROUGE requires human reference summaries. If the reference is poor, ROUGE scores are meaningless.
- No fluency assessment: A grammatically broken sentence can score highly on ROUGE if it contains the right words.
- Single correct answer assumption: There are many valid ways to summarise a text. ROUGE penalises valid alternatives that differ from the reference.
ROUGE in practice
Despite its limitations, ROUGE remains widely used because:
- It is fast, cheap, and reproducible β essential for comparing many models across many datasets.
- It correlates reasonably well with human judgements for factual summarisation tasks.
- It provides a standardised benchmark that allows comparison across research papers.
- It is a useful first filter β if ROUGE scores are very low, the model is likely performing poorly.
Beyond ROUGE
Modern evaluation increasingly supplements ROUGE with:
- BERTScore: Uses embeddings to measure semantic similarity rather than surface word overlap.
- Human evaluation: Still the gold standard for assessing quality, fluency, and factual accuracy.
- LLM-as-judge: Using a powerful language model to evaluate the output of another model.
Why This Matters
ROUGE scores appear frequently in AI product claims and research papers. Understanding what they measure β and what they miss β helps you critically evaluate claims about AI summarisation quality and avoid being misled by impressive-sounding numbers that may not reflect real-world performance.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β