Skip to main content
Early access — new tools and guides added regularly
Practical

Tokenizer (Tokeniser)

Last reviewed: April 2026

The component that converts text into tokens — the numerical units an AI model processes. Different models use different tokenisers, which affects how they handle text.

A tokeniser (also spelled tokenizer) is the software component that converts human-readable text into the numerical tokens that an AI model can process. It sits between you and the model — transforming your words into numbers on the way in, and numbers back into words on the way out. Every AI interaction begins and ends with the tokeniser.

Why tokenisers exist

AI models do not understand text. They process numbers. The tokeniser's job is to create a bridge between human language and mathematical computation:

  1. Text to tokens (encoding): Your prompt "How do I improve my marketing?" is converted into a sequence of numerical token IDs — perhaps [2437, 466, 314, 4916, 616, 8661, 30]
  2. Tokens to text (decoding): The model's numerical output is converted back into readable text

How tokenisers work

Modern tokenisers use a technique called subword tokenisation. Rather than treating each word as a token (too many unique words) or each character as a token (too many tokens per sentence), they break text into frequently occurring subword units:

  • Common words become single tokens: "the" → [1]
  • Less common words are split into subwords: "tokenisation" → ["token", "isation"]
  • Rare words are broken into smaller pieces: "pneumonia" → ["pn", "eum", "onia"]

The tokeniser learns these splits from a large text corpus, identifying which subword units appear most frequently. This creates a vocabulary — typically 32,000 to 100,000 tokens — that efficiently represents any text.

Different models, different tokenisers

Each AI model family uses its own tokeniser with its own vocabulary:

  • GPT models: Use the tiktoken tokeniser with a ~100,000 token vocabulary
  • Claude: Uses its own tokeniser (SentencePiece-based)
  • Llama: Uses a SentencePiece tokeniser with a ~32,000 token vocabulary

This means the same text produces different token counts depending on the model. A sentence that is 10 tokens in GPT-4 might be 12 tokens in Llama. This affects context window usage and pricing.

Why tokenisation matters for your work

Understanding tokenisation has practical implications:

  • Cost estimation: Since AI APIs charge per token, knowing how text translates to tokens helps you estimate costs. Roughly: 750 words ≈ 1,000 tokens.
  • Context window management: Your prompt, the conversation history, and the AI's response all consume tokens from the context window. Efficient tokenisation means more room for content.
  • Language considerations: Non-English languages often use more tokens per word than English. Chinese, Japanese, and Korean text can use 2-3x more tokens for the same content, affecting both cost and context limits.
  • Code and numbers: Code, mathematical expressions, and structured data (JSON, XML) can be tokenised less efficiently than prose, consuming more tokens than you might expect.
  • Prompt optimisation: Concise prompts use fewer tokens. "Summarise the key points" uses fewer tokens than "Please go ahead and provide me with a comprehensive summary of the key and important points."

Token counting tools

Most AI platforms provide tools to count tokens before sending prompts:

  • OpenAI's tiktoken library (Python/JavaScript)
  • Anthropic's token counting API
  • Online token counters for quick estimates

For rough estimation without tools, remember: 1 token ≈ 4 characters in English, or about 0.75 words.

Tokenisation quirks

Tokenisers can produce unexpected results:

  • Simple arithmetic can be tokenised in ways that make calculation harder for the model
  • Punctuation and special characters consume their own tokens
  • URLs and technical notation can be surprisingly token-heavy
  • Trailing whitespace and formatting characters all count as tokens
Want to go deeper?
This topic is covered in our Foundations level. Unlock all 52 lessons free.

Why This Matters

Tokenisation is the hidden mechanism that determines how much your AI interactions cost, how much context you can include in a conversation, and why AI sometimes behaves unexpectedly with numbers or non-English text. Understanding tokenisation helps you estimate costs accurately, optimise prompt length, and diagnose issues when AI behaviour seems inconsistent. It is the practical knowledge that turns AI budgeting from guesswork into informed planning.

Related Terms

Learn More

Continue learning in Foundations

This topic is covered in our lesson: How Large Language Models Actually Work