Practical

Throughput

Last reviewed: April 2026

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Throughput in AI refers to the volume of work a system can process in a given time period. It is typically measured in tokens per second (for a single request) or requests per minute (for a system). While latency asks "how fast is each individual response?", throughput asks "how much total work can the system handle?"

Why throughput matters

Throughput becomes critical when AI use scales beyond individual conversations:

Batch processing: Processing 10,000 customer feedback entries, thousands of documents, or large datasets requires high throughput to finish in a reasonable time
Production applications: An AI-powered chatbot serving 1,000 simultaneous users needs enough throughput to handle all conversations without degradation
Data pipelines: AI integrated into automated workflows must process items as fast as they arrive
Cost efficiency: Higher throughput often means lower cost per unit of work, as fixed infrastructure costs are spread across more requests

Throughput metrics

Throughput is measured differently depending on context:

Tokens per second (TPS): How many tokens the model generates per second. A model producing 100 tokens per second generates a 500-word response in roughly 7 seconds.
Requests per minute (RPM): How many separate API requests the system can handle per minute. This is often a rate limit imposed by the AI provider.
Tokens per minute (TPM): The total number of tokens (input + output) the system can process per minute. This is another common rate limit.

Factors affecting throughput

Hardware: More GPUs and faster GPUs mean higher throughput. Cloud AI providers scale hardware to meet demand.
Model size: Smaller models generally have higher throughput than larger models. A 7B parameter model can generate tokens much faster than a 400B parameter model.
Batching: Processing multiple requests simultaneously rather than one at a time dramatically increases system throughput. Most AI providers do this automatically.
Quantisation: Running models at lower numerical precision (e.g., 4-bit instead of 16-bit) increases throughput with some quality trade-off.
Provider capacity: Each AI provider allocates different throughput limits based on your pricing tier.

Throughput and cost

Throughput directly impacts AI operating costs:

Higher throughput per GPU means lower cost per token
AI providers with better throughput optimisation can offer lower prices
For batch workloads, maximising throughput minimises the time (and therefore cost) of infrastructure usage
Rate limits on API plans often define throughput caps — exceeding them requires upgrading

Throughput tiers from AI providers

Most AI API providers offer different throughput tiers:

Free tier: Limited RPM and TPM — suitable for experimentation
Standard tier: Higher limits — suitable for production applications with moderate traffic
Enterprise tier: Custom limits — suitable for high-volume applications
Batch API: Some providers offer batch endpoints with higher throughput at lower cost, with the trade-off of higher latency (results delivered later, not in real time)

Optimising for throughput

If throughput is a bottleneck for your use case:

Use batch APIs: When real-time responses are not needed, batch APIs offer higher throughput at lower cost
Parallel requests: Send multiple requests simultaneously rather than sequentially (within rate limits)
Right-size your model: Use the smallest model that meets your quality requirements
Optimise output length: Request only the output you need — shorter responses mean more responses per second
Cache common queries: Avoid reprocessing identical or similar requests
Use multiple providers: Distribute workload across providers to aggregate throughput

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Throughput determines the practical limits of AI at scale in your organisation. An AI tool that works beautifully for one person may become a bottleneck when deployed to a team of 100. Understanding throughput helps you plan for AI scaling, negotiate API contracts, choose the right pricing tier, and architect systems that can handle your actual workload. It is the difference between AI as a personal tool and AI as business infrastructure.

Related Terms

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

API (Application Programming Interface)

A way for software to communicate with other software. APIs are how developers connect AI capabilities to websites, apps, and business tools.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

GPU (Graphics Processing Unit)

A specialised processor originally designed for rendering graphics but now essential for training and running AI models. GPUs can perform thousands of calculations simultaneously.

Parameters

The total number of adjustable values in an AI model. A model with more parameters can capture more complex patterns but requires more computing power to train and run.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: From Chat to Agent: The AI Capability Spectrum

← Back to Glossary