AI Latency Budget
The maximum acceptable response time for an AI-powered feature, broken down across its components β model inference, retrieval, network, and processing β to guide performance optimisation.
An AI latency budget is the maximum acceptable response time for an AI-powered feature, broken down into its constituent components. Like a financial budget allocates spending across categories, a latency budget allocates acceptable delay across the pipeline stages that contribute to total response time.
Why latency budgets matter
Users have specific expectations for response times:
- Conversational AI: Under 1 second for the first token; 2-5 seconds for a complete response
- Search and retrieval: Under 500 milliseconds
- Document processing: Seconds to minutes, depending on document length
- Background analysis: Minutes to hours, depending on complexity
If total latency exceeds user expectations, the feature feels broken regardless of how good the output is. A latency budget ensures every component stays within bounds.
Breaking down the budget
A typical AI-powered feature has several latency components:
- Network latency (50-200ms): Round-trip time between your server and the AI provider
- Input processing (10-100ms): Tokenisation, prompt assembly, context preparation
- Retrieval (50-500ms): Vector search, database queries, knowledge base lookup (for RAG systems)
- Model inference (200-5000ms): The actual AI model processing β typically the largest component
- Output processing (10-50ms): Parsing, formatting, safety filtering
- Application logic (10-100ms): Business logic, state management, response assembly
Optimising each component
- Network: Choose AI providers with edge deployments close to your users. Use streaming to show tokens as they are generated.
- Retrieval: Use efficient vector indices (HNSW), cache frequent queries, pre-compute embeddings.
- Inference: Choose smaller models for simple tasks, use prompt caching, enable speculative decoding if available.
- Processing: Optimise code, cache intermediate results, parallelise independent steps.
Streaming and perceived latency
Streaming responses β showing text as it is generated rather than waiting for the complete response β dramatically improves perceived latency even when total latency is unchanged. A 5-second response that starts appearing after 500 milliseconds feels much faster than a 3-second response with a blank screen.
Most modern AI APIs support streaming, and implementing it is one of the highest-impact latency optimisations.
Latency budgets for multi-step agents
AI agents that use tools and make multiple model calls face compounding latency:
- Step 1: Planning call (2 seconds)
- Step 2: Tool execution (1 second)
- Step 3: Response generation (2 seconds)
- Total: 5 seconds minimum
For agents, latency budgets must account for the number of expected steps and include strategies for keeping users informed during processing (progress indicators, intermediate updates).
Monitoring and alerting
Track actual latency against your budget in production:
- Set alerts when p50 (median) or p95 (95th percentile) latency exceeds budget thresholds
- Break down latency by component to identify bottlenecks
- Track latency trends over time to catch gradual degradation
Why This Matters
Response time directly affects user adoption and satisfaction with AI features. A latency budget gives you a structured approach to performance optimisation, ensuring that every component of your AI pipeline contributes to a responsive user experience.
Related Terms
Continue learning in Expert
This topic is covered in our lesson: Scaling AI Across the Organisation
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β