Semantic Similarity
A measure of how close in meaning two pieces of text are, regardless of whether they share the same words.
Semantic similarity measures how close in meaning two pieces of text are, regardless of the specific words used. "The cat sat on the mat" and "A feline rested on the rug" share no important words but have nearly identical meaning β high semantic similarity.
Why it matters
Traditional text matching relies on shared words. If a customer searches for "laptop that doesn't overheat" but your product descriptions say "advanced thermal management," keyword search fails. Semantic similarity connects these because the underlying concepts are related.
How semantic similarity works
Modern semantic similarity uses embeddings β converting text into numerical vectors (lists of numbers) that capture meaning. Texts with similar meanings end up as vectors that are close together in this high-dimensional space.
To measure similarity between two texts:
- Convert each text into an embedding vector using an embedding model.
- Calculate the distance between the vectors using a similarity metric.
- A high similarity score means the texts are semantically close.
Common similarity metrics
- Cosine similarity: Measures the angle between two vectors. Ranges from -1 (opposite meaning) to 1 (identical meaning). The most widely used metric.
- Euclidean distance: Measures the straight-line distance between vectors. Smaller distance means more similar.
- Dot product: A fast computation that works well when vectors are normalised.
Applications
- Semantic search: Finding documents that match the meaning of a query, not just the keywords.
- Duplicate detection: Identifying similar content or support tickets even when worded differently.
- Recommendation systems: Suggesting content similar to what a user has engaged with.
- Clustering: Grouping related documents, reviews, or feedback automatically.
- Plagiarism detection: Finding paraphrased content that keyword matching would miss.
- FAQ matching: Routing customer questions to the most relevant existing answer.
Choosing an embedding model
Different embedding models are optimised for different tasks. Some excel at short text (sentences), others at long documents. Some are general-purpose, others are fine-tuned for specific domains. Popular choices include OpenAI's text-embedding models, Cohere's embed models, and open-source options like BGE and E5.
Limitations
- Semantic similarity captures meaning overlap but may miss nuance, negation, or conditional statements.
- Performance depends heavily on the quality of the embedding model.
- Cross-lingual semantic similarity requires multilingual embedding models.
Why This Matters
Semantic similarity is the foundation of modern search, recommendation, and knowledge retrieval systems. Understanding it helps you design better search experiences, build more effective RAG systems, and evaluate whether an AI tool truly understands meaning or merely matches keywords.
Related Terms
Continue learning in Essentials
This topic is covered in our lesson: How AI Understands Meaning