Throughput
The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.
Throughput in AI refers to the volume of work a system can process in a given time period. It is typically measured in tokens per second (for a single request) or requests per minute (for a system). While latency asks "how fast is each individual response?", throughput asks "how much total work can the system handle?"
Why throughput matters
Throughput becomes critical when AI use scales beyond individual conversations:
- Batch processing: Processing 10,000 customer feedback entries, thousands of documents, or large datasets requires high throughput to finish in a reasonable time
- Production applications: An AI-powered chatbot serving 1,000 simultaneous users needs enough throughput to handle all conversations without degradation
- Data pipelines: AI integrated into automated workflows must process items as fast as they arrive
- Cost efficiency: Higher throughput often means lower cost per unit of work, as fixed infrastructure costs are spread across more requests
Throughput metrics
Throughput is measured differently depending on context:
- Tokens per second (TPS): How many tokens the model generates per second. A model producing 100 tokens per second generates a 500-word response in roughly 7 seconds.
- Requests per minute (RPM): How many separate API requests the system can handle per minute. This is often a rate limit imposed by the AI provider.
- Tokens per minute (TPM): The total number of tokens (input + output) the system can process per minute. This is another common rate limit.
Factors affecting throughput
- Hardware: More GPUs and faster GPUs mean higher throughput. Cloud AI providers scale hardware to meet demand.
- Model size: Smaller models generally have higher throughput than larger models. A 7B parameter model can generate tokens much faster than a 400B parameter model.
- Batching: Processing multiple requests simultaneously rather than one at a time dramatically increases system throughput. Most AI providers do this automatically.
- Quantisation: Running models at lower numerical precision (e.g., 4-bit instead of 16-bit) increases throughput with some quality trade-off.
- Provider capacity: Each AI provider allocates different throughput limits based on your pricing tier.
Throughput and cost
Throughput directly impacts AI operating costs:
- Higher throughput per GPU means lower cost per token
- AI providers with better throughput optimisation can offer lower prices
- For batch workloads, maximising throughput minimises the time (and therefore cost) of infrastructure usage
- Rate limits on API plans often define throughput caps — exceeding them requires upgrading
Throughput tiers from AI providers
Most AI API providers offer different throughput tiers:
- Free tier: Limited RPM and TPM — suitable for experimentation
- Standard tier: Higher limits — suitable for production applications with moderate traffic
- Enterprise tier: Custom limits — suitable for high-volume applications
- Batch API: Some providers offer batch endpoints with higher throughput at lower cost, with the trade-off of higher latency (results delivered later, not in real time)
Optimising for throughput
If throughput is a bottleneck for your use case:
- Use batch APIs: When real-time responses are not needed, batch APIs offer higher throughput at lower cost
- Parallel requests: Send multiple requests simultaneously rather than sequentially (within rate limits)
- Right-size your model: Use the smallest model that meets your quality requirements
- Optimise output length: Request only the output you need — shorter responses mean more responses per second
- Cache common queries: Avoid reprocessing identical or similar requests
- Use multiple providers: Distribute workload across providers to aggregate throughput
Why This Matters
Throughput determines the practical limits of AI at scale in your organisation. An AI tool that works beautifully for one person may become a bottleneck when deployed to a team of 100. Understanding throughput helps you plan for AI scaling, negotiate API contracts, choose the right pricing tier, and architect systems that can handle your actual workload. It is the difference between AI as a personal tool and AI as business infrastructure.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: From Chat to Agent: The AI Capability Spectrum