Practical

Rate Limiting

Last reviewed: April 2026

Controls imposed by AI providers that restrict how many requests you can make per minute or per day, preventing overuse and ensuring fair access.

Rate limiting is a control mechanism used by AI API providers to restrict the number of requests a user or application can make within a given time period. If you exceed your rate limit, your requests are rejected until the limit resets.

Why providers rate-limit

AI inference is computationally expensive. Every request consumes GPU time and memory. Without rate limits:

A single heavy user could monopolise shared resources, degrading performance for everyone.
A misconfigured application could send millions of requests accidentally, creating enormous bills.
Denial-of-service attacks could overwhelm the service.

Rate limits protect both the provider and the user.

Common rate limit types

Requests per minute (RPM): The number of API calls allowed per minute. A limit of 60 RPM means one request per second on average.
Tokens per minute (TPM): The total number of input and output tokens allowed per minute. This accounts for the fact that large requests consume more resources.
Requests per day (RPD): A daily cap on total requests.
Concurrent requests: The number of requests you can have in-flight simultaneously.

How rate limits are communicated

Providers typically include rate limit information in API response headers:

`x-ratelimit-limit`: Your maximum allowed requests.
`x-ratelimit-remaining`: How many requests you have left in the current window.
`x-ratelimit-reset`: When the limit resets.

When you hit the limit, you receive an HTTP 429 "Too Many Requests" response.

Handling rate limits in your application

Exponential backoff: When you receive a 429 response, wait an increasing amount of time before retrying. First retry after 1 second, then 2, then 4.
Request queuing: Buffer requests in a queue and release them at a controlled rate.
Batching: Combine multiple small operations into fewer, larger requests where possible.
Load distribution: Spread requests across multiple API keys or providers.
Caching: Store and reuse responses for identical or similar requests.

Rate limits by tier

Most providers offer different rate limits based on your plan and usage history. Free tiers have the lowest limits; enterprise tiers have the highest. Spending more and maintaining a good track record typically increases your limits over time.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Rate limits are a practical constraint that every AI-powered application must design around. Ignoring them leads to dropped requests, poor user experiences, and unexpected failures under load. Understanding rate limits early prevents embarrassing production issues when your application scales.

Related Terms

API (Application Programming Interface)

A way for software to communicate with other software. APIs are how developers connect AI capabilities to websites, apps, and business tools.

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Model Serving

The infrastructure and process of making a trained AI model available to receive requests and return predictions in real time.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Working with AI APIs

← Back to Glossary