Practical

Prompt Caching

Last reviewed: April 2026

A feature that reuses previously processed prompt content across API calls, reducing latency and cost when the same system prompt or context is sent repeatedly.

Prompt caching is an API-level optimisation that stores processed prompt prefixes so they do not need to be recomputed on every request. When you send the same system prompt, instructions, or context across multiple API calls, the cached portion is processed once and reused — reducing both cost and response time.

How it works

Every time you call an AI API, the model processes your entire prompt from scratch — system prompt, conversation history, context documents, and your new message. For applications that use the same system prompt across thousands of requests, this is wasteful.

With prompt caching, the provider stores the processed representation of the static portion (typically the system prompt and any fixed context). Subsequent requests that share the same prefix skip the processing of the cached portion, paying only for the new content.

Cost impact

Cached input tokens typically cost 80-90% less than uncached input tokens. For applications with long system prompts (common in agent deployments), this can reduce total API costs by 30-50%.

Latency impact

Skipping the processing of cached tokens reduces time to first token. For prompts with 10,000+ tokens of cached context, this can shave 1-3 seconds off response time.

When it matters

Prompt caching is relevant primarily for developers building AI-powered applications via API. If you are using ChatGPT or Claude through their web interfaces, caching is handled transparently. It becomes important when you are: running AI agents with long system prompts, building applications that serve many users with the same base instructions, or processing batch workflows where the same context is reused.

Want to go deeper?

This topic is covered in our Expert level. Access all 100+ lessons free.

Why This Matters

For organisations deploying AI at scale — hundreds or thousands of API calls per day — prompt caching is the single most impactful cost optimisation. Understanding when and how to leverage it can reduce AI infrastructure costs by 30-50%, which directly affects the ROI calculations that determine whether AI projects get continued funding.

Related Terms

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

System Prompt

A set of persistent instructions given to an AI that shapes its behaviour for an entire conversation. System prompts define the AI's role, tone, rules, and output format.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Learn More

Continue learning in Expert

This topic is covered in our lesson: Token Management: Controlling Your AI Spend

← Back to Glossary