Text Embedding
The process of converting text into numerical vectors that capture its meaning, enabling machines to measure similarity and perform semantic search.
A text embedding is a numerical representation of text β a list of numbers (vector) that captures the meaning of a word, sentence, or document. These vectors enable machines to compare, search, and organise text based on meaning rather than just matching keywords.
From words to numbers
Computers cannot understand words directly. They need numbers. An embedding model converts text into a vector β typically a list of 256 to 3,072 numbers. Each number represents a dimension of meaning. While individual dimensions are not human-interpretable, the overall pattern captures semantic content.
The crucial property: texts with similar meanings produce similar vectors (close together in the vector space), and texts with different meanings produce dissimilar vectors (far apart).
How embeddings are created
Modern embedding models are typically transformer-based neural networks trained on massive text corpora. During training, the model learns to produce vectors where:
- Synonyms cluster together.
- Related concepts are nearby.
- Unrelated concepts are distant.
- Relationships are preserved (king - man + woman β queen in the vector space).
Types of text embeddings
- Word embeddings: A vector for each individual word. Early approaches (Word2Vec, GloVe) that revolutionised NLP.
- Sentence embeddings: A vector for an entire sentence, capturing its overall meaning. Better for most practical tasks.
- Document embeddings: A vector for a full document, used for document-level similarity and retrieval.
- Passage embeddings: Vectors for chunks of text, commonly used in RAG applications.
Practical applications
- Semantic search: Find documents matching the meaning of a query, not just keywords.
- RAG: Retrieve relevant context for LLM generation based on semantic similarity.
- Clustering: Group similar documents, support tickets, or feedback automatically.
- Classification: Use embeddings as features for downstream classification models.
- Anomaly detection: Identify text that is semantically unusual compared to a baseline.
- Recommendation: Suggest content similar to what a user has engaged with.
Choosing an embedding model
Key considerations:
- Dimension size: Higher dimensions capture more nuance but use more storage and compute.
- Training data: Models trained on your domain's language perform better.
- Speed vs quality: Smaller models embed faster; larger models produce better representations.
- Multilingual support: If you work across languages, choose a model trained on multilingual data.
Why This Matters
Text embeddings are the foundation of modern AI search, retrieval, and organisation. Understanding them is essential for building effective RAG systems, semantic search features, and any application that needs to compare or organise text by meaning. They turn the abstract concept of "meaning" into something measurable and computable.
Related Terms
Continue learning in Essentials
This topic is covered in our lesson: How AI Understands Meaning