Practical

Token Budget

Last reviewed: April 2026

The maximum number of tokens allocated for an AI interaction, encompassing both the input (prompt and context) and the output (generated response).

A token budget is the total number of tokens available for an AI interaction — covering both the tokens you send in (your prompt, context, and conversation history) and the tokens the model generates in response. Managing your token budget effectively is key to controlling costs and getting the best results.

How token budgets work

Every AI model has a maximum context window — say 128,000 tokens. This entire window must accommodate your input and the model's output. If your input uses 120,000 tokens, only 8,000 tokens remain for the response. If you set a maximum output length of 4,000 tokens, your input budget is 124,000 tokens.

AI providers charge per token — both input tokens and output tokens (with output tokens typically costing more). Your token budget therefore directly determines the cost of each interaction.

Components of the token budget

System prompt: Instructions that define the model's behaviour. Can range from a few hundred to several thousand tokens.
Conversation history: Previous messages in a multi-turn conversation. Grows with each turn.
Retrieved context: Documents provided via RAG for grounding. Often the largest component.
User message: The current query or instruction.
Model response: The generated output, bounded by max_tokens setting.

Managing token budgets

Prompt optimization: Write concise system prompts. Remove unnecessary instructions and examples.
Context selection: Only retrieve and include the most relevant documents, not everything potentially related.
Conversation management: Summarize or trim conversation history when it grows too long.
Output limits: Set appropriate max_tokens to prevent unnecessarily long responses.
Model selection: Use smaller, cheaper models for simple tasks and reserve larger models for complex ones.

Token budget and cost

Understanding token budgets is essential for cost management. A workflow that processes 100 documents at 10,000 tokens each, with a 2,000-token response for each, uses 1.2 million tokens. At typical API prices, this could cost anywhere from a few cents to several dollars depending on the model.

Common mistakes

Teams often over-stuff context with marginally relevant information, eating into the response budget and increasing cost without improving quality. Others set output limits too high, paying for verbose responses when concise ones would suffice.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Token budgets directly determine the cost and effectiveness of AI interactions. Understanding how to manage them helps you build AI applications that are both high-quality and economically sustainable, avoiding common pitfalls that waste money on unnecessary tokens.

Related Terms

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Retrieval-Augmented Generation (RAG)

A technique that connects AI to your own documents and data so it can answer questions using your specific information, not just its general training.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: The Economics of AI

← Back to Glossary