Core AI

Text Embedding

Last reviewed: April 2026

The process of converting text into numerical vectors that capture its meaning, enabling machines to measure similarity and perform semantic search.

A text embedding is a numerical representation of text — a list of numbers (vector) that captures the meaning of a word, sentence, or document. These vectors enable machines to compare, search, and organise text based on meaning rather than just matching keywords.

From words to numbers

Computers cannot understand words directly. They need numbers. An embedding model converts text into a vector — typically a list of 256 to 3,072 numbers. Each number represents a dimension of meaning. While individual dimensions are not human-interpretable, the overall pattern captures semantic content.

The crucial property: texts with similar meanings produce similar vectors (close together in the vector space), and texts with different meanings produce dissimilar vectors (far apart).

How embeddings are created

Modern embedding models are typically transformer-based neural networks trained on massive text corpora. During training, the model learns to produce vectors where:

Synonyms cluster together.
Related concepts are nearby.
Unrelated concepts are distant.
Relationships are preserved (king - man + woman ≈ queen in the vector space).

Types of text embeddings

Word embeddings: A vector for each individual word. Early approaches (Word2Vec, GloVe) that revolutionised NLP.
Sentence embeddings: A vector for an entire sentence, capturing its overall meaning. Better for most practical tasks.
Document embeddings: A vector for a full document, used for document-level similarity and retrieval.
Passage embeddings: Vectors for chunks of text, commonly used in RAG applications.

Practical applications

Semantic search: Find documents matching the meaning of a query, not just keywords.
RAG: Retrieve relevant context for LLM generation based on semantic similarity.
Clustering: Group similar documents, support tickets, or feedback automatically.
Classification: Use embeddings as features for downstream classification models.
Anomaly detection: Identify text that is semantically unusual compared to a baseline.
Recommendation: Suggest content similar to what a user has engaged with.

Choosing an embedding model

Key considerations:

Dimension size: Higher dimensions capture more nuance but use more storage and compute.
Training data: Models trained on your domain's language perform better.
Speed vs quality: Smaller models embed faster; larger models produce better representations.
Multilingual support: If you work across languages, choose a model trained on multilingual data.

Want to go deeper?

This topic is covered in our Essentials level. Access all 100+ lessons free.

Why This Matters

Text embeddings are the foundation of modern AI search, retrieval, and organisation. Understanding them is essential for building effective RAG systems, semantic search features, and any application that needs to compare or organise text by meaning. They turn the abstract concept of "meaning" into something measurable and computable.

Related Terms

Embedding

A numerical representation of text (or images, audio, etc.) that captures its meaning. Embeddings let AI measure how similar two pieces of content are.

Vector Database

A specialised database designed to store and search embeddings — the numerical representations of text, images, or other data used in AI applications.

Semantic Similarity

A measure of how close in meaning two pieces of text are, regardless of whether they share the same words.

Semantic Search

Search that finds results based on meaning and intent rather than exact keyword matches. Powered by vector embeddings that represent concepts as numbers.

Retrieval-Augmented Generation (RAG)

A technique that connects AI to your own documents and data so it can answer questions using your specific information, not just its general training.

Learn More

Continue learning in Essentials

This topic is covered in our lesson: How AI Understands Meaning

← Back to Glossary