Latency
The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.
Latency in AI refers to the time delay between sending a prompt and receiving a response — specifically, the time until the first token of the response appears (called time to first token, or TTFT). In everyday terms, it is how long you wait before the AI starts responding.
Why latency matters
Latency directly affects user experience and productivity:
- Interactive conversations: High latency makes AI chat feel sluggish and frustrating. Sub-second TTFT feels responsive and natural.
- Real-time applications: AI features in live applications (autocomplete, search, chatbots) need very low latency to feel seamless. Users notice delays over 200-300 milliseconds.
- Batch processing: For automated workflows processing thousands of requests, latency per request compounds. Shaving 500ms per request across 10,000 requests saves nearly 1.5 hours.
- Agentic workflows: AI agents that take multiple sequential steps are particularly sensitive to latency — each step's delay adds up across the entire chain.
What causes latency
Several factors contribute to AI latency:
- Network time: The round-trip time for your request to reach the AI provider's servers and for the response to come back. Geographic distance matters.
- Queue time: If the provider's servers are busy, your request may wait in a queue before processing begins.
- Model size: Larger models with more parameters take longer to generate each token. A 7-billion parameter model responds faster than a 400-billion parameter model.
- Prompt length: Longer prompts take longer to process. The model must read and process every input token before generating output.
- Generation length: Longer responses take longer to complete (though streaming means you see the first tokens quickly).
- Infrastructure: The type and configuration of hardware (GPU model, memory, networking) affects processing speed.
Latency vs throughput
These two metrics are related but different:
- Latency: How long until you get a response (measured in milliseconds or seconds)
- Throughput: How many requests the system can handle per second (measured in requests/second or tokens/second)
A system can have low latency (fast individual responses) but low throughput (cannot handle many simultaneous users), or vice versa. AI providers optimise for both.
Reducing latency in practice
If latency is a concern for your application:
- Choose smaller models for simple tasks: A fast model for classification, a powerful model for complex reasoning
- Use streaming: Most AI APIs support streaming, where tokens are sent as they are generated rather than waiting for the complete response. This reduces perceived latency.
- Optimise prompt length: Shorter prompts process faster. Remove unnecessary context.
- Use cached or pre-computed results: For common queries, cache AI responses instead of generating them each time
- Choose nearby regions: Use AI providers with servers close to your users
- Consider edge deployment: For extremely low-latency needs, run smaller models on local hardware
Latency benchmarks
Typical latency ranges for major AI APIs (time to first token):
- Fast models: 100-300ms (smaller models, optimised for speed)
- Standard models: 300ms-1 second (general-purpose models)
- Large models: 1-3 seconds (frontier models with complex reasoning)
- Self-hosted models: Varies widely based on hardware
For most business applications, latency under 2 seconds is acceptable for interactive use. For real-time features, under 500ms is the target.
Why This Matters
Latency determines whether AI feels like a helpful assistant or a frustrating bottleneck. When evaluating AI tools and APIs for your organisation, latency should be a key consideration alongside accuracy and cost. A model that is 5% more accurate but takes 3x longer to respond may not be the right choice for interactive applications. Understanding latency trade-offs helps you match AI tools to the speed requirements of each use case.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: From Chat to Agent: The AI Capability Spectrum