Practical

AI Latency Budget

Last reviewed: April 2026

The maximum acceptable response time for an AI-powered feature, broken down across its components — model inference, retrieval, network, and processing — to guide performance optimisation.

An AI latency budget is the maximum acceptable response time for an AI-powered feature, broken down into its constituent components. Like a financial budget allocates spending across categories, a latency budget allocates acceptable delay across the pipeline stages that contribute to total response time.

Why latency budgets matter

Users have specific expectations for response times:

Conversational AI: Under 1 second for the first token; 2-5 seconds for a complete response
Search and retrieval: Under 500 milliseconds
Document processing: Seconds to minutes, depending on document length
Background analysis: Minutes to hours, depending on complexity

If total latency exceeds user expectations, the feature feels broken regardless of how good the output is. A latency budget ensures every component stays within bounds.

Breaking down the budget

A typical AI-powered feature has several latency components:

Network latency (50-200ms): Round-trip time between your server and the AI provider
Input processing (10-100ms): Tokenisation, prompt assembly, context preparation
Retrieval (50-500ms): Vector search, database queries, knowledge base lookup (for RAG systems)
Model inference (200-5000ms): The actual AI model processing — typically the largest component
Output processing (10-50ms): Parsing, formatting, safety filtering
Application logic (10-100ms): Business logic, state management, response assembly

Optimising each component

Network: Choose AI providers with edge deployments close to your users. Use streaming to show tokens as they are generated.
Retrieval: Use efficient vector indices (HNSW), cache frequent queries, pre-compute embeddings.
Inference: Choose smaller models for simple tasks, use prompt caching, enable speculative decoding if available.
Processing: Optimise code, cache intermediate results, parallelise independent steps.

Streaming and perceived latency

Streaming responses — showing text as it is generated rather than waiting for the complete response — dramatically improves perceived latency even when total latency is unchanged. A 5-second response that starts appearing after 500 milliseconds feels much faster than a 3-second response with a blank screen.

Most modern AI APIs support streaming, and implementing it is one of the highest-impact latency optimisations.

Latency budgets for multi-step agents

AI agents that use tools and make multiple model calls face compounding latency:

Step 1: Planning call (2 seconds)
Step 2: Tool execution (1 second)
Step 3: Response generation (2 seconds)
Total: 5 seconds minimum

For agents, latency budgets must account for the number of expected steps and include strategies for keeping users informed during processing (progress indicators, intermediate updates).

Monitoring and alerting

Track actual latency against your budget in production:

Set alerts when p50 (median) or p95 (95th percentile) latency exceeds budget thresholds
Break down latency by component to identify bottlenecks
Track latency trends over time to catch gradual degradation

Want to go deeper?

This topic is covered in our Expert level. Access all 100+ lessons free.

Why This Matters

Response time directly affects user adoption and satisfaction with AI features. A latency budget gives you a structured approach to performance optimisation, ensuring that every component of your AI pipeline contributes to a responsive user experience.

Related Terms

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Streaming Response

A method where the AI sends its response word-by-word as it generates, rather than waiting until the full response is complete before showing anything.

Learn More

Continue learning in Expert

This topic is covered in our lesson: Scaling AI Across the Organisation

← Back to Glossary