Tokenization
The process of breaking text into smaller units called tokens that an AI model can process, forming the fundamental input representation for language models.
Tokenization is the process of breaking text into smaller pieces called tokens β the atomic units that a language model processes. Before an AI model can understand your prompt, it must first convert your text into a sequence of tokens. This seemingly simple step has profound implications for how AI models work, what they cost, and where they struggle.
How tokenization works
Tokenizers do not simply split text into words. They use algorithms that break text into sub-word units:
- Common words often become single tokens: "the," "and," "hello"
- Less common words are split into pieces: "tokenization" might become "token" + "ization"
- Rare words are broken into smaller fragments: "Enigmatica" might become "En" + "igm" + "atica"
- Spaces, punctuation, and special characters are also tokens
A typical English word averages about 1.3 tokens. One token is roughly four characters of English text.
Why sub-word tokenization
The alternative β having a separate token for every possible word β would require an impossibly large vocabulary. Sub-word tokenization strikes a balance:
- Common words are efficient (one token each)
- Rare words can still be represented (assembled from smaller pieces)
- New words and misspellings are handled gracefully
- The vocabulary stays manageable (typically 30,000-100,000 tokens)
Popular tokenization methods
- Byte-Pair Encoding (BPE): Used by GPT models. Starts with individual characters and iteratively merges the most frequent pairs.
- WordPiece: Used by BERT. Similar to BPE but uses a different merging criterion.
- SentencePiece: Treats text as raw bytes, making it language-agnostic. Used by many multilingual models.
Why tokenization matters for costs
AI API pricing is typically per token β both input and output. Understanding tokenization helps you estimate costs:
- A 500-word email is roughly 650-750 tokens
- A 10-page report might be 7,000-10,000 tokens
- Code tends to use more tokens per line than prose
Tokenization quirks
Tokenization explains several AI behaviours that seem odd:
- AI struggles to count letters in words (because it processes tokens, not characters)
- Code uses more tokens than you might expect (whitespace, brackets, and operators are all tokens)
- Non-English languages often require more tokens per word, making API calls more expensive
- AI may split names, numbers, or technical terms in unexpected ways
Why This Matters
Tokenization directly affects AI costs, context window limits, and model behaviour. Understanding how text is tokenized helps you write more cost-efficient prompts, estimate API expenses accurately, and diagnose strange AI behaviour that stems from how the model sees your input.
Related Terms
Continue learning in Essentials
This topic is covered in our lesson: How Large Language Models Actually Work