Practical

Tokenizer (Tokeniser)

Last reviewed: April 2026

The component that converts text into tokens — the numerical units an AI model processes. Different models use different tokenisers, which affects how they handle text.

A tokeniser (also spelled tokenizer) is the software component that converts human-readable text into the numerical tokens that an AI model can process. It sits between you and the model — transforming your words into numbers on the way in, and numbers back into words on the way out. Every AI interaction begins and ends with the tokeniser.

Why tokenisers exist

AI models do not understand text. They process numbers. The tokeniser's job is to create a bridge between human language and mathematical computation:

Text to tokens (encoding): Your prompt "How do I improve my marketing?" is converted into a sequence of numerical token IDs — perhaps [2437, 466, 314, 4916, 616, 8661, 30]
Tokens to text (decoding): The model's numerical output is converted back into readable text

How tokenisers work

Modern tokenisers use a technique called subword tokenisation. Rather than treating each word as a token (too many unique words) or each character as a token (too many tokens per sentence), they break text into frequently occurring subword units:

Common words become single tokens: "the" → [1]
Less common words are split into subwords: "tokenisation" → ["token", "isation"]
Rare words are broken into smaller pieces: "pneumonia" → ["pn", "eum", "onia"]

The tokeniser learns these splits from a large text corpus, identifying which subword units appear most frequently. This creates a vocabulary — typically 32,000 to 100,000 tokens — that efficiently represents any text.

Different models, different tokenisers

Each AI model family uses its own tokeniser with its own vocabulary:

GPT models: Use the tiktoken tokeniser with a ~100,000 token vocabulary
Claude: Uses its own tokeniser (SentencePiece-based)
Llama: Uses a SentencePiece tokeniser with a ~32,000 token vocabulary

This means the same text produces different token counts depending on the model. A sentence that is 10 tokens in GPT-4 might be 12 tokens in Llama. This affects context window usage and pricing.

Why tokenisation matters for your work

Understanding tokenisation has practical implications:

Cost estimation: Since AI APIs charge per token, knowing how text translates to tokens helps you estimate costs. Roughly: 750 words ≈ 1,000 tokens.
Context window management: Your prompt, the conversation history, and the AI's response all consume tokens from the context window. Efficient tokenisation means more room for content.
Language considerations: Non-English languages often use more tokens per word than English. Chinese, Japanese, and Korean text can use 2-3x more tokens for the same content, affecting both cost and context limits.
Code and numbers: Code, mathematical expressions, and structured data (JSON, XML) can be tokenised less efficiently than prose, consuming more tokens than you might expect.
Prompt optimisation: Concise prompts use fewer tokens. "Summarise the key points" uses fewer tokens than "Please go ahead and provide me with a comprehensive summary of the key and important points."

Token counting tools

Most AI platforms provide tools to count tokens before sending prompts:

OpenAI's tiktoken library (Python/JavaScript)
Anthropic's token counting API
Online token counters for quick estimates

For rough estimation without tools, remember: 1 token ≈ 4 characters in English, or about 0.75 words.

Tokenisation quirks

Tokenisers can produce unexpected results:

Simple arithmetic can be tokenised in ways that make calculation harder for the model
Punctuation and special characters consume their own tokens
URLs and technical notation can be surprisingly token-heavy
Trailing whitespace and formatting characters all count as tokens

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

Tokenisation is the hidden mechanism that determines how much your AI interactions cost, how much context you can include in a conversation, and why AI sometimes behaves unexpectedly with numbers or non-English text. Understanding tokenisation helps you estimate costs accurately, optimise prompt length, and diagnose issues when AI behaviour seems inconsistent. It is the practical knowledge that turns AI budgeting from guesswork into informed planning.

Related Terms

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Prompt Engineering

The skill of writing instructions to AI that consistently produce useful, accurate, high-quality output.

API (Application Programming Interface)

A way for software to communicate with other software. APIs are how developers connect AI capabilities to websites, apps, and business tools.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: How Large Language Models Actually Work

← Back to Glossary