Tokenizer (Tokeniser)
The component that converts text into tokens — the numerical units an AI model processes. Different models use different tokenisers, which affects how they handle text.
A tokeniser (also spelled tokenizer) is the software component that converts human-readable text into the numerical tokens that an AI model can process. It sits between you and the model — transforming your words into numbers on the way in, and numbers back into words on the way out. Every AI interaction begins and ends with the tokeniser.
Why tokenisers exist
AI models do not understand text. They process numbers. The tokeniser's job is to create a bridge between human language and mathematical computation:
- Text to tokens (encoding): Your prompt "How do I improve my marketing?" is converted into a sequence of numerical token IDs — perhaps [2437, 466, 314, 4916, 616, 8661, 30]
- Tokens to text (decoding): The model's numerical output is converted back into readable text
How tokenisers work
Modern tokenisers use a technique called subword tokenisation. Rather than treating each word as a token (too many unique words) or each character as a token (too many tokens per sentence), they break text into frequently occurring subword units:
- Common words become single tokens: "the" → [1]
- Less common words are split into subwords: "tokenisation" → ["token", "isation"]
- Rare words are broken into smaller pieces: "pneumonia" → ["pn", "eum", "onia"]
The tokeniser learns these splits from a large text corpus, identifying which subword units appear most frequently. This creates a vocabulary — typically 32,000 to 100,000 tokens — that efficiently represents any text.
Different models, different tokenisers
Each AI model family uses its own tokeniser with its own vocabulary:
- GPT models: Use the tiktoken tokeniser with a ~100,000 token vocabulary
- Claude: Uses its own tokeniser (SentencePiece-based)
- Llama: Uses a SentencePiece tokeniser with a ~32,000 token vocabulary
This means the same text produces different token counts depending on the model. A sentence that is 10 tokens in GPT-4 might be 12 tokens in Llama. This affects context window usage and pricing.
Why tokenisation matters for your work
Understanding tokenisation has practical implications:
- Cost estimation: Since AI APIs charge per token, knowing how text translates to tokens helps you estimate costs. Roughly: 750 words ≈ 1,000 tokens.
- Context window management: Your prompt, the conversation history, and the AI's response all consume tokens from the context window. Efficient tokenisation means more room for content.
- Language considerations: Non-English languages often use more tokens per word than English. Chinese, Japanese, and Korean text can use 2-3x more tokens for the same content, affecting both cost and context limits.
- Code and numbers: Code, mathematical expressions, and structured data (JSON, XML) can be tokenised less efficiently than prose, consuming more tokens than you might expect.
- Prompt optimisation: Concise prompts use fewer tokens. "Summarise the key points" uses fewer tokens than "Please go ahead and provide me with a comprehensive summary of the key and important points."
Token counting tools
Most AI platforms provide tools to count tokens before sending prompts:
- OpenAI's tiktoken library (Python/JavaScript)
- Anthropic's token counting API
- Online token counters for quick estimates
For rough estimation without tools, remember: 1 token ≈ 4 characters in English, or about 0.75 words.
Tokenisation quirks
Tokenisers can produce unexpected results:
- Simple arithmetic can be tokenised in ways that make calculation harder for the model
- Punctuation and special characters consume their own tokens
- URLs and technical notation can be surprisingly token-heavy
- Trailing whitespace and formatting characters all count as tokens
Why This Matters
Tokenisation is the hidden mechanism that determines how much your AI interactions cost, how much context you can include in a conversation, and why AI sometimes behaves unexpectedly with numbers or non-English text. Understanding tokenisation helps you estimate costs accurately, optimise prompt length, and diagnose issues when AI behaviour seems inconsistent. It is the practical knowledge that turns AI budgeting from guesswork into informed planning.
Related Terms
Continue learning in Foundations
This topic is covered in our lesson: How Large Language Models Actually Work