Practical

Latency

Last reviewed: April 2026

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Latency in AI refers to the time delay between sending a prompt and receiving a response — specifically, the time until the first token of the response appears (called time to first token, or TTFT). In everyday terms, it is how long you wait before the AI starts responding.

Why latency matters

Latency directly affects user experience and productivity:

Interactive conversations: High latency makes AI chat feel sluggish and frustrating. Sub-second TTFT feels responsive and natural.
Real-time applications: AI features in live applications (autocomplete, search, chatbots) need very low latency to feel seamless. Users notice delays over 200-300 milliseconds.
Batch processing: For automated workflows processing thousands of requests, latency per request compounds. Shaving 500ms per request across 10,000 requests saves nearly 1.5 hours.
Agentic workflows: AI agents that take multiple sequential steps are particularly sensitive to latency — each step's delay adds up across the entire chain.

What causes latency

Several factors contribute to AI latency:

Network time: The round-trip time for your request to reach the AI provider's servers and for the response to come back. Geographic distance matters.
Queue time: If the provider's servers are busy, your request may wait in a queue before processing begins.
Model size: Larger models with more parameters take longer to generate each token. A 7-billion parameter model responds faster than a 400-billion parameter model.
Prompt length: Longer prompts take longer to process. The model must read and process every input token before generating output.
Generation length: Longer responses take longer to complete (though streaming means you see the first tokens quickly).
Infrastructure: The type and configuration of hardware (GPU model, memory, networking) affects processing speed.

Latency vs throughput

These two metrics are related but different:

Latency: How long until you get a response (measured in milliseconds or seconds)
Throughput: How many requests the system can handle per second (measured in requests/second or tokens/second)

A system can have low latency (fast individual responses) but low throughput (cannot handle many simultaneous users), or vice versa. AI providers optimise for both.

Reducing latency in practice

If latency is a concern for your application:

Choose smaller models for simple tasks: A fast model for classification, a powerful model for complex reasoning
Use streaming: Most AI APIs support streaming, where tokens are sent as they are generated rather than waiting for the complete response. This reduces perceived latency.
Optimise prompt length: Shorter prompts process faster. Remove unnecessary context.
Use cached or pre-computed results: For common queries, cache AI responses instead of generating them each time
Choose nearby regions: Use AI providers with servers close to your users
Consider edge deployment: For extremely low-latency needs, run smaller models on local hardware

Latency benchmarks

Typical latency ranges for major AI APIs (time to first token):

Fast models: 100-300ms (smaller models, optimised for speed)
Standard models: 300ms-1 second (general-purpose models)
Large models: 1-3 seconds (frontier models with complex reasoning)
Self-hosted models: Varies widely based on hardware

For most business applications, latency under 2 seconds is acceptable for interactive use. For real-time features, under 500ms is the target.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Latency determines whether AI feels like a helpful assistant or a frustrating bottleneck. When evaluating AI tools and APIs for your organisation, latency should be a key consideration alongside accuracy and cost. A model that is 5% more accurate but takes 3x longer to respond may not be the right choice for interactive applications. Understanding latency trade-offs helps you match AI tools to the speed requirements of each use case.

Related Terms

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

API (Application Programming Interface)

A way for software to communicate with other software. APIs are how developers connect AI capabilities to websites, apps, and business tools.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

GPU (Graphics Processing Unit)

A specialised processor originally designed for rendering graphics but now essential for training and running AI models. GPUs can perform thousands of calculations simultaneously.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: From Chat to Agent: The AI Capability Spectrum

← Back to Glossary