Cosine Similarity
A mathematical measure of how similar two vectors are by calculating the cosine of the angle between them, widely used in AI to compare documents, images, and search queries.
Cosine similarity is a mathematical measure that quantifies how similar two vectors are by computing the cosine of the angle between them. In AI, it is the standard method for comparing embeddings β the numerical representations that models use to encode the meaning of text, images, and other data.
How it works
Two vectors can point in similar or different directions. Cosine similarity measures this:
- A cosine similarity of 1 means the vectors point in exactly the same direction β the items are maximally similar.
- A cosine similarity of 0 means the vectors are perpendicular β the items are unrelated.
- A cosine similarity of -1 means the vectors point in opposite directions β the items are maximally dissimilar.
The beauty of cosine similarity is that it ignores magnitude (the length of the vectors) and focuses purely on direction. A short document and a long document about the same topic will have similar directions even though their magnitudes differ.
Why AI uses cosine similarity
When an AI model converts text into an embedding vector, semantically similar texts end up with similar vectors. Cosine similarity provides a fast, efficient way to measure that similarity numerically.
Common applications include:
- Semantic search: When you search a knowledge base, your query is embedded into a vector, and cosine similarity finds the documents with the most similar vectors.
- Recommendation systems: Finding products, articles, or content similar to what a user has previously engaged with.
- Duplicate detection: Identifying near-duplicate documents, support tickets, or customer enquiries.
- Clustering: Grouping similar items together based on their vector representations.
- RAG (Retrieval Augmented Generation): The retrieval step in RAG typically uses cosine similarity to find the most relevant documents to include in the AI's context.
Cosine similarity versus other distance metrics
- Euclidean distance measures the straight-line distance between two points. It is sensitive to magnitude, which can be a disadvantage when comparing documents of different lengths.
- Dot product is faster to compute but conflates similarity with magnitude. Useful when magnitude carries meaning (e.g., document importance).
- Cosine similarity is generally preferred for text and semantic comparisons because normalising for magnitude usually produces more meaningful similarity scores.
Practical considerations
When building semantic search or recommendation systems, the choice of similarity metric can significantly affect results. Cosine similarity is the default choice for most text-based applications, but the best metric depends on how the embeddings were trained. Always check the model documentation for the recommended similarity measure.
Why This Matters
Cosine similarity is the engine behind semantic search, document matching, and AI-powered recommendations. Understanding it helps you evaluate and troubleshoot these systems β when search results seem off, the issue often lies in how similarity is being measured.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β