Core AI

Tokenization

Last reviewed: April 2026

The process of breaking text into smaller units called tokens that an AI model can process, forming the fundamental input representation for language models.

Tokenization is the process of breaking text into smaller pieces called tokens — the atomic units that a language model processes. Before an AI model can understand your prompt, it must first convert your text into a sequence of tokens. This seemingly simple step has profound implications for how AI models work, what they cost, and where they struggle.

How tokenization works

Tokenizers do not simply split text into words. They use algorithms that break text into sub-word units:

Common words often become single tokens: "the," "and," "hello"
Less common words are split into pieces: "tokenization" might become "token" + "ization"
Rare words are broken into smaller fragments: "Enigmatica" might become "En" + "igm" + "atica"
Spaces, punctuation, and special characters are also tokens

A typical English word averages about 1.3 tokens. One token is roughly four characters of English text.

Why sub-word tokenization

The alternative — having a separate token for every possible word — would require an impossibly large vocabulary. Sub-word tokenization strikes a balance:

Common words are efficient (one token each)
Rare words can still be represented (assembled from smaller pieces)
New words and misspellings are handled gracefully
The vocabulary stays manageable (typically 30,000-100,000 tokens)

Popular tokenization methods

Byte-Pair Encoding (BPE): Used by GPT models. Starts with individual characters and iteratively merges the most frequent pairs.
WordPiece: Used by BERT. Similar to BPE but uses a different merging criterion.
SentencePiece: Treats text as raw bytes, making it language-agnostic. Used by many multilingual models.

Why tokenization matters for costs

AI API pricing is typically per token — both input and output. Understanding tokenization helps you estimate costs:

A 500-word email is roughly 650-750 tokens
A 10-page report might be 7,000-10,000 tokens
Code tends to use more tokens per line than prose

Tokenization quirks

Tokenization explains several AI behaviours that seem odd:

AI struggles to count letters in words (because it processes tokens, not characters)
Code uses more tokens than you might expect (whitespace, brackets, and operators are all tokens)
Non-English languages often require more tokens per word, making API calls more expensive
AI may split names, numbers, or technical terms in unexpected ways

Want to go deeper?

This topic is covered in our Essentials level. Access all 100+ lessons free.

Why This Matters

Tokenization directly affects AI costs, context window limits, and model behaviour. Understanding how text is tokenized helps you write more cost-efficient prompts, estimate API expenses accurately, and diagnose strange AI behaviour that stems from how the model sees your input.

Related Terms

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Tokenizer (Tokeniser)

The component that converts text into tokens — the numerical units an AI model processes. Different models use different tokenisers, which affects how they handle text.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

API (Application Programming Interface)

A way for software to communicate with other software. APIs are how developers connect AI capabilities to websites, apps, and business tools.

Learn More

Continue learning in Essentials

This topic is covered in our lesson: How Large Language Models Actually Work

← Back to Glossary