Semantic Chunking
A technique for splitting documents into meaningful segments based on topic or meaning rather than arbitrary character counts, improving the quality of AI retrieval and analysis.
Semantic chunking is a document processing technique that splits text into segments based on meaning and topic rather than arbitrary character or token counts. It is a critical component of retrieval-augmented generation (RAG) systems, where the quality of chunks directly affects the quality of AI responses.
The problem with naive chunking
The simplest approach to chunking is splitting text at fixed intervals β every 500 tokens, for example. This is fast and easy to implement but creates several problems:
- Broken context: A paragraph explaining a concept might be split in the middle, with the setup in one chunk and the conclusion in another.
- Mixed topics: A single chunk might contain the end of one section and the beginning of another, making it match queries about both topics poorly.
- Lost structure: Document structure (headers, sections, lists) is ignored, losing valuable organisational information.
How semantic chunking works
Semantic chunking uses meaning to determine where to split:
- Embedding-based splitting: Compute embeddings for successive sentences or paragraphs. When the cosine similarity between consecutive segments drops below a threshold, insert a split β the topic has changed.
- Structure-aware splitting: Use document structure (headings, paragraphs, section breaks) as natural split points, keeping related content together.
- LLM-based splitting: Use a language model to identify topic boundaries and determine optimal split points.
- Hybrid approaches: Combine structural and semantic signals β split at paragraph boundaries, but merge short paragraphs about the same topic and split long paragraphs that cover multiple topics.
Why chunk quality matters for RAG
In a RAG system, the retrieval step finds chunks that are relevant to the user's query. If chunks are poorly constructed:
- A relevant answer might be split across multiple chunks, with neither chunk individually matching the query well enough to be retrieved.
- Irrelevant information mixed into a chunk adds noise that confuses the model.
- Missing context makes it impossible for the model to formulate a complete answer.
Chunk size considerations
- Too small: Individual sentences lose context. "It increased by 40%" means nothing without knowing what "it" refers to.
- Too large: Chunks contain too much irrelevant information, diluting the relevant content and consuming context window space.
- Sweet spot: Typically 200-500 tokens, though this varies by document type and use case.
Advanced techniques
- Overlapping chunks: Create chunks that overlap by a percentage, ensuring that information near chunk boundaries is captured in multiple chunks.
- Hierarchical chunking: Create chunks at multiple granularities (paragraph, section, document) and retrieve at the most appropriate level.
- Parent-child chunking: Index small chunks for precise retrieval but return the parent chunk (which includes surrounding context) to the model.
Why This Matters
Semantic chunking is often the difference between a RAG system that gives accurate, contextual answers and one that produces vague or incomplete responses. Understanding chunking strategies helps you diagnose and fix quality issues in AI-powered knowledge systems.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β