Prompt Caching vs Semantic Caching: Which One Do You Actually Need?
Prompt caching saves input tokens. Semantic caching eliminates the call entirely. Here’s when to use each, with real pricing and a decision framework.
We were paying for caching twice and saving less than doing it once
A team using our proxy had an AI agent with a 6,000-token system prompt, a set of 15 tool definitions, and a customer support workflow that handled roughly 12,000 requests per month. They enabled Anthropic’s prompt caching for the system prompt prefix. Input costs dropped 40%. Good.
Then they enabled semantic caching through Govyn for the user queries. Hit rate climbed to 54%. Cost dropped another 48% on top of the prompt caching savings. Total reduction: 72%.
Here is the part they did not expect: on cache hits, the prompt caching discount was irrelevant. When semantic caching returned a stored response, no LLM call happened at all. Zero input tokens. Zero output tokens. The prompt cache was never consulted. The $0.30/M token cache read fee from Anthropic was never charged.
The two systems do not compete. They do not even operate at the same layer. But most teams think they are choosing between them. They are not. They are choosing where to start — and the answer depends on something nobody is talking about: output tokens.
What is prompt caching?
Prompt caching is a provider-side optimization that stores the processed token state (KV cache) from a model’s prefill phase. When the same prefix appears in a subsequent request, the model skips recomputing those tokens and loads the cached state directly. This reduces time-to-first-token (TTFT) by 39-65% and input token costs by 50-90%, depending on the provider.
Prompt caching requires an exact prefix match. Even a single token difference in the prefix invalidates the cache. It is deterministic: there is no similarity threshold, no false positive risk, no wrong answer served. The tradeoff is that it only works for structurally identical prefixes.
Each provider implements prompt caching differently:
| Anthropic | OpenAI | ||
|---|---|---|---|
| Activation | Explicit breakpoints or auto flag | Fully automatic | Implicit (auto) or explicit (API) |
| Min tokens | 1,024-4,096 (varies by model) | 1,024 | 1,024-4,096 (varies by model) |
| Cache duration | 5 min or 1 hour | 5-10 min (up to 24h for GPT-5.x) | 1 hour default (configurable) |
| Input discount | 90% (cache read vs base) | 50-90% (varies by model) | 90% |
| Cache write cost | 1.25x base (5m) or 2x base (1h) | No surcharge | No surcharge (storage fee/hr) |
| Output tokens | Full price | Full price | Full price |
| Manual control | Yes (breakpoints, TTL) | No | Yes (explicit caching API) |
The critical row is output tokens. Prompt caching does not reduce output costs at all. The model still generates every token in the response from scratch. For models with high output pricing — Claude Opus at $25/M tokens, GPT-4o at $10/M — this means prompt caching leaves the most expensive part of the bill untouched.
What is semantic caching?
Semantic caching is an infrastructure-side optimization that stores complete LLM responses and matches new requests by meaning rather than by exact input. It converts each request into an embedding vector, compares it against cached vectors using cosine similarity, and returns the stored response if similarity exceeds a threshold (typically 0.85-0.95).
Unlike prompt caching, semantic caching eliminates the LLM call entirely on a hit. Zero input tokens. Zero output tokens. Zero inference latency. Response time drops from 2-5 seconds to under 50 milliseconds. For a detailed look at how this works with AI agents specifically, see our post on semantic caching for AI agents.
The tradeoff is risk. Semantic caching matches by meaning, not by exact content. Two requests that an embedding model considers “similar enough” might actually require different responses. This is why threshold tuning, cache key composition, and false positive monitoring are essential.
| Prompt caching | Semantic caching | |
|---|---|---|
| What it caches | Tokenized prefix state (KV cache) | Complete LLM response |
| Where it runs | Provider-side (Anthropic, OpenAI, Google) | Your infrastructure (proxy, gateway) |
| Match method | Exact prefix match | Embedding similarity |
| Input token savings | 50-90% | 100% (no call made) |
| Output token savings | 0% | 100% (no call made) |
| Latency reduction | 39-65% TTFT improvement | 95-99% (ms vs seconds) |
| Risk of wrong answer | None | Possible (tunable) |
| Provider lock-in | Yes (per-provider cache) | No (works across any LLM) |
Why do output tokens matter more than input tokens for caching?
Most caching comparisons focus on input token savings. This is misleading because output tokens are often the larger cost driver, and only one caching method addresses them.
Consider a Claude Sonnet 4.5 request with a 4,000-token system prompt, 500-token user query, and 800-token response:
With prompt caching only
Input cost:
System prompt (cached read): 4,000 tokens x $0.30/M = $0.0012
User query (uncached): 500 tokens x $3.00/M = $0.0015
Output cost:
Response: 800 tokens x $15.00/M = $0.0120
Total: $0.0147
With semantic caching (on hit)
Input cost: $0.00 (no API call)
Output cost: $0.00 (no API call)
Embedding: ~$0.0001
Total: $0.0001
Prompt caching saved 68% on input tokens but the total bill was $0.0147. Semantic caching eliminated it to $0.0001. The difference is the output tokens — $0.012 per request that prompt caching cannot touch.
At 12,000 requests/month with a 54% semantic cache hit rate:
| No caching | Prompt caching only | Semantic caching only | Both | |
|---|---|---|---|---|
| Monthly cost | $220 | $143 | $102 | $67 |
| Savings | — | 35% | 54% | 70% |
The “both” scenario works because semantic caching handles the 54% of requests that are semantically similar (eliminating them entirely), and prompt caching reduces costs on the remaining 46% that still go to the LLM.
When is prompt caching the right choice?
Prompt caching is the correct optimization when your requests share structural repetition — the same prefix appears across many requests, but the queries themselves are unique.
Large system prompts. An agent with 8,000 tokens of instructions, persona definition, and few-shot examples reuses that prefix on every call. Prompt caching saves 90% on those 8,000 tokens per request. If the agent handles 10,000 requests/month on Claude Sonnet, that is $216/month saved on input tokens alone.
Tool definitions. An agent with 15 tools and their JSON schemas might add 3,000-5,000 tokens to every request. These are identical across calls. Prompt caching eliminates reprocessing them.
Multi-turn conversations. Each turn in a conversation appends to the message history. The previous turns form a growing prefix that is identical to the last request. Prompt caching accelerates every turn after the first.
Document analysis. A user uploads a 50-page PDF and asks multiple questions about it. The document tokens are the same prefix for every question. Prompt caching avoids re-ingesting the document on each query.
Zero implementation effort (OpenAI). OpenAI’s prompt caching is fully automatic. No code changes, no cache breakpoints, no configuration. Point your requests at the API and the caching happens. If your stack is OpenAI-only, you get prompt caching for free.
When is semantic caching the right choice?
Semantic caching is the correct optimization when your requests have semantic repetition — different users ask the same question in different words.
Customer support. “How do I reset my password?”, “I forgot my password”, “Can’t log in, need password help” — three different strings, one intent. Semantic caching catches all three. In our measurements, support triage agents achieved 68% cache hit rates, reducing costs from $285/month to $78/month.
FAQ and documentation search. Users searching a knowledge base ask variations of the same questions repeatedly. Hit rates of 40-60% are common for documentation agents.
Multi-provider architectures. If your system routes requests to different LLMs based on complexity (GPT-4o-mini for simple queries, Claude Opus for complex ones), prompt caching is siloed per provider. Each provider maintains its own cache. Semantic caching at the proxy layer works across all providers — a cached response from a Claude request can serve a future GPT request if the queries are semantically identical.
Cost-dominated workloads. When output tokens are the primary cost driver (code generation, long-form content, detailed analysis), prompt caching’s input-only savings leave the majority of the bill intact. Semantic caching eliminates both input and output costs on every hit.
Latency-sensitive applications. Prompt caching reduces TTFT by 39-65% but the model still generates the full response. Semantic caching returns the cached response in under 50ms. For applications where sub-second response time matters, this is the difference between “faster” and “instant.”
When should you use both?
The two caching mechanisms operate at different layers and address different patterns. Using both is not redundant — it is defense in depth for your API bill.
The architecture:
User request
|
v
[Semantic cache layer] -- proxy/gateway
|-- HIT: Return cached response (0 tokens, <50ms)
|-- MISS: Forward to provider
|
v
[Prompt cache layer] -- provider-side
|-- Prefix cached: Reduced input cost, faster TTFT
|-- Prefix not cached: Full price, full latency
|
v
LLM generates response
|
v
Cache response in semantic cache for future hits
The semantic cache acts as the first gate. If it hits, the request never reaches the provider and prompt caching is irrelevant. If it misses, prompt caching kicks in as the second line of defense, reducing the cost of the LLM call that has to happen.
This pattern is most valuable when:
-
Agent workloads with large tool definitions. The tools are the same prefix (prompt caching), but many user queries repeat (semantic caching). The semantic cache catches the repeats. The misses still benefit from the cached tool prefix.
-
Enterprise chatbots with long system prompts. The system prompt is static (prompt caching), but customers ask the same questions (semantic caching). Common questions are served instantly. Novel questions still benefit from the cached system prompt.
-
RAG pipelines with stable knowledge bases. The retrieved context may differ per query (limiting prompt caching), but the user intent often repeats (semantic caching). Semantic caching short-circuits the entire RAG pipeline on popular queries.
How do you decide which caching to use?
| Your situation | Start with | Why |
|---|---|---|
| Large system prompt, unique queries | Prompt caching | Structural repetition, no semantic overlap |
| Small system prompt, repeated queries | Semantic caching | Semantic repetition dominates |
| Large prompt + repeated queries | Both | Each layer catches a different pattern |
| Multi-turn conversations | Prompt caching | Growing prefix, unique per session |
| Multi-provider routing | Semantic caching | Provider-agnostic, cross-model hits |
| Cost is mostly output tokens | Semantic caching | Only option that eliminates output costs |
| Zero implementation effort | Prompt caching (OpenAI) | Automatic, no configuration |
| Need sub-100ms responses | Semantic caching | Only option that skips inference entirely |
If you are unsure, start with prompt caching. It is risk-free (no false positives), often automatic, and requires no infrastructure. Then measure your query patterns. If more than 25% of your queries have semantic overlap, add semantic caching on top. The two layers compose cleanly.
Calculate your own savings
The table above gives the general framework. To see the exact dollar impact for your workload, plug in your model, request volume, and token counts. The calculator compares all four scenarios side by side with real provider pricing.
Open the LLM Caching Cost Calculator →How much does each caching method cost?
Prompt caching cost per 1M input tokens
| Provider | Model | Base input | Cached read | You save |
|---|---|---|---|---|
| Anthropic | Claude Sonnet 4.5 | $3.00 | $0.30 | $2.70 (90%) |
| Anthropic | Claude Opus 4.6 | $5.00 | $0.50 | $4.50 (90%) |
| Anthropic | Claude Haiku 4.5 | $1.00 | $0.10 | $0.90 (90%) |
| OpenAI | GPT-4o | $2.50 | $1.25 | $1.25 (50%) |
| OpenAI | GPT-4.1 | $2.00 | $0.50 | $1.50 (75%) |
| OpenAI | GPT-5 | $1.25 | $0.125 | $1.125 (90%) |
| Gemini 2.5 Pro | $1.25 | $0.125 | $1.125 (90%) | |
| Gemini 2.5 Flash | $0.30 | $0.03 | $0.27 (90%) |
Semantic caching cost per request (hit)
| Component | Cost | Notes |
|---|---|---|
| Embedding call | ~$0.0001 | text-embedding-3-small, 500 tokens |
| Vector search | ~$0.00001 | Vectorize/Pinecone per query |
| Storage | Negligible | Cached response stored in KV |
| LLM call | $0.00 | Skipped entirely |
| Total per hit | ~$0.0001 | vs $0.01-0.05 per LLM call |
The math is simple. A semantic cache hit costs roughly $0.0001 regardless of which model the request would have gone to. A prompt-cached Claude Opus request still costs $0.50/M input + $25.00/M output. The more expensive your model, the more valuable semantic caching becomes.
What are the limitations of prompt caching?
Three limitations that are not bugs — they are architectural constraints of how prompt caching works.
1. It cannot reduce output costs. The model generates every output token from scratch. For code generation, long-form analysis, or any task with substantial output, the majority of your bill is untouched by prompt caching.
2. It cannot work across providers. Anthropic’s prompt cache lives on Anthropic’s servers. OpenAI’s lives on OpenAI’s. If you route 60% of requests to Claude and 40% to GPT, you maintain two separate cache pools with no cross-pollination. Semantic caching at the proxy layer caches responses regardless of which model generated them.
3. It cannot match different phrasings. “What is our refund policy?” with a prompt-cached prefix still misses if the next request is “How do refunds work?” with the same prefix. The prefix is cached, but the user query portion is different, so the model generates a new response. Semantic caching would recognize these as the same intent and return the cached response.
These are not criticisms of prompt caching. They are the boundaries of what exact-prefix caching can optimize. Semantic caching extends optimization beyond those boundaries.
Key takeaways
-
Prompt caching and semantic caching are not competitors. They operate at different layers (provider vs infrastructure) and catch different patterns (structural vs semantic repetition).
-
Output tokens are the hidden cost. Prompt caching only reduces input costs. Semantic caching eliminates both input and output costs on every hit. For expensive models, this difference is 5-10x.
-
Start with prompt caching if you are unsure. It is risk-free, often automatic, and requires no infrastructure. Add semantic caching when you measure semantic repetition above 25%.
-
Multi-provider architectures need semantic caching. Provider-side prompt caches are siloed. A proxy-level semantic cache works across all providers.
-
Use both for maximum savings. Semantic cache catches the repeats. Prompt cache reduces the misses. The combined savings exceed either alone by 15-30%.
FAQ
Does prompt caching work with semantic caching?
Yes. They compose naturally. The semantic cache sits in front of the provider. On a miss, the request goes to the provider where prompt caching reduces the input cost. On a hit, the provider is never called. There is no conflict between the two.
Which saves more money?
Semantic caching, if your hit rate exceeds 20-30%. A 50% semantic cache hit rate eliminates half of all LLM costs (input + output). Prompt caching at 90% input discount on the remaining half saves another 35-40% of input costs. But prompt caching is risk-free and effortless — it’s a strictly better starting point when you have no data on query patterns.
Can prompt caching serve wrong answers?
No. Prompt caching requires an exact prefix match. It does not serve cached responses — it accelerates the computation of tokens the model has already processed. The model still generates a fresh response for every request. There is no false positive risk.
Is semantic caching safe for production?
Yes, with proper implementation. The risks — false positives, cross-model contamination, stale responses — are all addressable with correct cache key composition, threshold tuning, and monitoring. We run semantic caching in production across thousands of agent workloads. The key is starting in observe mode and monitoring false positive rates before enforcing.
Do I need to change my code for either?
For prompt caching: no changes with OpenAI (automatic), minimal changes with Anthropic (add cache_control breakpoints), and API calls for Google. For semantic caching through a proxy like Govyn: no code changes — point your agent at the proxy URL instead of the provider URL. The proxy handles caching transparently.
Govyn is an open-source API proxy for AI agent governance. Semantic caching, prompt caching passthrough, model routing, and budget enforcement. MIT licensed. Self-host or cloud-hosted.