Prompt Caching vs Semantic Caching: Which One Do You Actually Need?

April 10, 2026 16 min read

prompt-caching semantic-caching cost-reduction llm-caching ai-agents

Prompt caching saves input tokens. Semantic caching eliminates the call entirely. Here’s when to use each, with real pricing and a decision framework.

We were paying for caching twice and saving less than doing it once

A team using our proxy had an AI agent with a 6,000-token system prompt, a set of 15 tool definitions, and a customer support workflow that handled roughly 12,000 requests per month. They enabled Anthropic’s prompt caching for the system prompt prefix. Input costs dropped 40%. Good.

Then they enabled semantic caching through Govyn for the user queries. Hit rate climbed to 54%. Cost dropped another 48% on top of the prompt caching savings. Total reduction: 72%.

Here is the part they did not expect: on cache hits, the prompt caching discount was irrelevant. When semantic caching returned a stored response, no LLM call happened at all. Zero input tokens. Zero output tokens. The prompt cache was never consulted. The $0.30/M token cache read fee from Anthropic was never charged.

The two systems do not compete. They do not even operate at the same layer. But most teams think they are choosing between them. They are not. They are choosing where to start — and the answer depends on something nobody is talking about: output tokens.

What is prompt caching?

Prompt caching is a provider-side optimization that stores the processed token state (KV cache) from a model’s prefill phase. When the same prefix appears in a subsequent request, the model skips recomputing those tokens and loads the cached state directly. This reduces time-to-first-token (TTFT) by 39-65% and input token costs by 50-90%, depending on the provider.

Prompt caching requires an exact prefix match. Even a single token difference in the prefix invalidates the cache. It is deterministic: there is no similarity threshold, no false positive risk, no wrong answer served. The tradeoff is that it only works for structurally identical prefixes.

Each provider implements prompt caching differently:

	Anthropic	OpenAI	Google
Activation	Explicit breakpoints or auto flag	Fully automatic	Implicit (auto) or explicit (API)
Min tokens	1,024-4,096 (varies by model)	1,024	1,024-4,096 (varies by model)
Cache duration	5 min or 1 hour	5-10 min (up to 24h for GPT-5.x)	1 hour default (configurable)
Input discount	90% (cache read vs base)	50-90% (varies by model)	90%
Cache write cost	1.25x base (5m) or 2x base (1h)	No surcharge	No surcharge (storage fee/hr)
Output tokens	Full price	Full price	Full price
Manual control	Yes (breakpoints, TTL)	No	Yes (explicit caching API)

The critical row is output tokens. Prompt caching does not reduce output costs at all. The model still generates every token in the response from scratch. For models with high output pricing — Claude Opus at $25/M tokens, GPT-4o at $10/M — this means prompt caching leaves the most expensive part of the bill untouched.

What is semantic caching?

Semantic caching is an infrastructure-side optimization that stores complete LLM responses and matches new requests by meaning rather than by exact input. It converts each request into an embedding vector, compares it against cached vectors using cosine similarity, and returns the stored response if similarity exceeds a threshold (typically 0.85-0.95).

Unlike prompt caching, semantic caching eliminates the LLM call entirely on a hit. Zero input tokens. Zero output tokens. Zero inference latency. Response time drops from 2-5 seconds to under 50 milliseconds. For a detailed look at how this works with AI agents specifically, see our post on semantic caching for AI agents.

The tradeoff is risk. Semantic caching matches by meaning, not by exact content. Two requests that an embedding model considers “similar enough” might actually require different responses. This is why threshold tuning, cache key composition, and false positive monitoring are essential.

	Prompt caching	Semantic caching
What it caches	Tokenized prefix state (KV cache)	Complete LLM response
Where it runs	Provider-side (Anthropic, OpenAI, Google)	Your infrastructure (proxy, gateway)
Match method	Exact prefix match	Embedding similarity
Input token savings	50-90%	100% (no call made)
Output token savings	0%	100% (no call made)
Latency reduction	39-65% TTFT improvement	95-99% (ms vs seconds)
Risk of wrong answer	None	Possible (tunable)
Provider lock-in	Yes (per-provider cache)	No (works across any LLM)

Prompt caching vs semantic caching architecture

Why do output tokens matter more than input tokens for caching?

Most caching comparisons focus on input token savings. This is misleading because output tokens are often the larger cost driver, and only one caching method addresses them.

Consider a Claude Sonnet 4.5 request with a 4,000-token system prompt, 500-token user query, and 800-token response:

With prompt caching only

Input cost:
  System prompt (cached read): 4,000 tokens x $0.30/M  = $0.0012
  User query (uncached):         500 tokens x $3.00/M  = $0.0015
Output cost:
  Response:                      800 tokens x $15.00/M = $0.0120
                                                  Total: $0.0147

With semantic caching (on hit)

Input cost:  $0.00 (no API call)
Output cost: $0.00 (no API call)
Embedding:   ~$0.0001
                        Total: $0.0001

Prompt caching saved 68% on input tokens but the total bill was $0.0147. Semantic caching eliminated it to $0.0001. The difference is the output tokens — $0.012 per request that prompt caching cannot touch.

At 12,000 requests/month with a 54% semantic cache hit rate:

	No caching	Prompt caching only	Semantic caching only	Both
Monthly cost	$220	$143	$102	$67
Savings	—	35%	54%	70%

The “both” scenario works because semantic caching handles the 54% of requests that are semantically similar (eliminating them entirely), and prompt caching reduces costs on the remaining 46% that still go to the LLM.

When is prompt caching the right choice?

Prompt caching is the correct optimization when your requests share structural repetition — the same prefix appears across many requests, but the queries themselves are unique.

Large system prompts. An agent with 8,000 tokens of instructions, persona definition, and few-shot examples reuses that prefix on every call. Prompt caching saves 90% on those 8,000 tokens per request. If the agent handles 10,000 requests/month on Claude Sonnet, that is $216/month saved on input tokens alone.

Tool definitions. An agent with 15 tools and their JSON schemas might add 3,000-5,000 tokens to every request. These are identical across calls. Prompt caching eliminates reprocessing them.

Multi-turn conversations. Each turn in a conversation appends to the message history. The previous turns form a growing prefix that is identical to the last request. Prompt caching accelerates every turn after the first.

Document analysis. A user uploads a 50-page PDF and asks multiple questions about it. The document tokens are the same prefix for every question. Prompt caching avoids re-ingesting the document on each query.

Zero implementation effort (OpenAI). OpenAI’s prompt caching is fully automatic. No code changes, no cache breakpoints, no configuration. Point your requests at the API and the caching happens. If your stack is OpenAI-only, you get prompt caching for free.

When is semantic caching the right choice?

Semantic caching is the correct optimization when your requests have semantic repetition — different users ask the same question in different words.

Customer support. “How do I reset my password?”, “I forgot my password”, “Can’t log in, need password help” — three different strings, one intent. Semantic caching catches all three. In our measurements, support triage agents achieved 68% cache hit rates, reducing costs from $285/month to $78/month.

FAQ and documentation search. Users searching a knowledge base ask variations of the same questions repeatedly. Hit rates of 40-60% are common for documentation agents.

Multi-provider architectures. If your system routes requests to different LLMs based on complexity (GPT-4o-mini for simple queries, Claude Opus for complex ones), prompt caching is siloed per provider. Each provider maintains its own cache. Semantic caching at the proxy layer works across all providers — a cached response from a Claude request can serve a future GPT request if the queries are semantically identical.

Cost-dominated workloads. When output tokens are the primary cost driver (code generation, long-form content, detailed analysis), prompt caching’s input-only savings leave the majority of the bill intact. Semantic caching eliminates both input and output costs on every hit.

Latency-sensitive applications. Prompt caching reduces TTFT by 39-65% but the model still generates the full response. Semantic caching returns the cached response in under 50ms. For applications where sub-second response time matters, this is the difference between “faster” and “instant.”

When should you use both?

The two caching mechanisms operate at different layers and address different patterns. Using both is not redundant — it is defense in depth for your API bill.

The architecture:

User request
    |
    v
[Semantic cache layer] -- proxy/gateway
    |-- HIT: Return cached response (0 tokens, <50ms)
    |-- MISS: Forward to provider
              |
              v
         [Prompt cache layer] -- provider-side
              |-- Prefix cached: Reduced input cost, faster TTFT
              |-- Prefix not cached: Full price, full latency
              |
              v
         LLM generates response
              |
              v
         Cache response in semantic cache for future hits

The semantic cache acts as the first gate. If it hits, the request never reaches the provider and prompt caching is irrelevant. If it misses, prompt caching kicks in as the second line of defense, reducing the cost of the LLM call that has to happen.

This pattern is most valuable when:

Agent workloads with large tool definitions. The tools are the same prefix (prompt caching), but many user queries repeat (semantic caching). The semantic cache catches the repeats. The misses still benefit from the cached tool prefix.
Enterprise chatbots with long system prompts. The system prompt is static (prompt caching), but customers ask the same questions (semantic caching). Common questions are served instantly. Novel questions still benefit from the cached system prompt.
RAG pipelines with stable knowledge bases. The retrieved context may differ per query (limiting prompt caching), but the user intent often repeats (semantic caching). Semantic caching short-circuits the entire RAG pipeline on popular queries.

How do you decide which caching to use?

Semantic caching decision framework

Your situation	Start with	Why
Large system prompt, unique queries	Prompt caching	Structural repetition, no semantic overlap
Small system prompt, repeated queries	Semantic caching	Semantic repetition dominates
Large prompt + repeated queries	Both	Each layer catches a different pattern
Multi-turn conversations	Prompt caching	Growing prefix, unique per session
Multi-provider routing	Semantic caching	Provider-agnostic, cross-model hits
Cost is mostly output tokens	Semantic caching	Only option that eliminates output costs
Zero implementation effort	Prompt caching (OpenAI)	Automatic, no configuration
Need sub-100ms responses	Semantic caching	Only option that skips inference entirely

If you are unsure, start with prompt caching. It is risk-free (no false positives), often automatic, and requires no infrastructure. Then measure your query patterns. If more than 25% of your queries have semantic overlap, add semantic caching on top. The two layers compose cleanly.

Calculate your own savings

The table above gives the general framework. To see the exact dollar impact for your workload, plug in your model, request volume, and token counts. The calculator compares all four scenarios side by side with real provider pricing.

Open the LLM Caching Cost Calculator →

How much does each caching method cost?

Prompt caching cost per 1M input tokens

Provider	Model	Base input	Cached read	You save
Anthropic	Claude Sonnet 4.5	$3.00	$0.30	$2.70 (90%)
Anthropic	Claude Opus 4.6	$5.00	$0.50	$4.50 (90%)
Anthropic	Claude Haiku 4.5	$1.00	$0.10	$0.90 (90%)
OpenAI	GPT-4o	$2.50	$1.25	$1.25 (50%)
OpenAI	GPT-4.1	$2.00	$0.50	$1.50 (75%)
OpenAI	GPT-5	$1.25	$0.125	$1.125 (90%)
Google	Gemini 2.5 Pro	$1.25	$0.125	$1.125 (90%)
Google	Gemini 2.5 Flash	$0.30	$0.03	$0.27 (90%)

Semantic caching cost per request (hit)

Component	Cost	Notes
Embedding call	~$0.0001	text-embedding-3-small, 500 tokens
Vector search	~$0.00001	Vectorize/Pinecone per query
Storage	Negligible	Cached response stored in KV
LLM call	$0.00	Skipped entirely
Total per hit	~$0.0001	vs $0.01-0.05 per LLM call

The math is simple. A semantic cache hit costs roughly $0.0001 regardless of which model the request would have gone to. A prompt-cached Claude Opus request still costs $0.50/M input + $25.00/M output. The more expensive your model, the more valuable semantic caching becomes.

What are the limitations of prompt caching?

Three limitations that are not bugs — they are architectural constraints of how prompt caching works.

1. It cannot reduce output costs. The model generates every output token from scratch. For code generation, long-form analysis, or any task with substantial output, the majority of your bill is untouched by prompt caching.

2. It cannot work across providers. Anthropic’s prompt cache lives on Anthropic’s servers. OpenAI’s lives on OpenAI’s. If you route 60% of requests to Claude and 40% to GPT, you maintain two separate cache pools with no cross-pollination. Semantic caching at the proxy layer caches responses regardless of which model generated them.

3. It cannot match different phrasings. “What is our refund policy?” with a prompt-cached prefix still misses if the next request is “How do refunds work?” with the same prefix. The prefix is cached, but the user query portion is different, so the model generates a new response. Semantic caching would recognize these as the same intent and return the cached response.

These are not criticisms of prompt caching. They are the boundaries of what exact-prefix caching can optimize. Semantic caching extends optimization beyond those boundaries.

Key takeaways

Prompt caching and semantic caching are not competitors. They operate at different layers (provider vs infrastructure) and catch different patterns (structural vs semantic repetition).
Output tokens are the hidden cost. Prompt caching only reduces input costs. Semantic caching eliminates both input and output costs on every hit. For expensive models, this difference is 5-10x.
Start with prompt caching if you are unsure. It is risk-free, often automatic, and requires no infrastructure. Add semantic caching when you measure semantic repetition above 25%.
Multi-provider architectures need semantic caching. Provider-side prompt caches are siloed. A proxy-level semantic cache works across all providers.
Use both for maximum savings. Semantic cache catches the repeats. Prompt cache reduces the misses. The combined savings exceed either alone by 15-30%.

FAQ

Does prompt caching work with semantic caching?

Yes. They compose naturally. The semantic cache sits in front of the provider. On a miss, the request goes to the provider where prompt caching reduces the input cost. On a hit, the provider is never called. There is no conflict between the two.

Which saves more money?

Semantic caching, if your hit rate exceeds 20-30%. A 50% semantic cache hit rate eliminates half of all LLM costs (input + output). Prompt caching at 90% input discount on the remaining half saves another 35-40% of input costs. But prompt caching is risk-free and effortless — it’s a strictly better starting point when you have no data on query patterns.

Can prompt caching serve wrong answers?

No. Prompt caching requires an exact prefix match. It does not serve cached responses — it accelerates the computation of tokens the model has already processed. The model still generates a fresh response for every request. There is no false positive risk.

Is semantic caching safe for production?

Yes, with proper implementation. The risks — false positives, cross-model contamination, stale responses — are all addressable with correct cache key composition, threshold tuning, and monitoring. We run semantic caching in production across thousands of agent workloads. The key is starting in observe mode and monitoring false positive rates before enforcing.

Do I need to change my code for either?

For prompt caching: no changes with OpenAI (automatic), minimal changes with Anthropic (add cache_control breakpoints), and API calls for Google. For semantic caching through a proxy like Govyn: no code changes — point your agent at the proxy URL instead of the provider URL. The proxy handles caching transparently.

Free guide

The Token Diet: stop overpaying for AI before you touch your architecture

Prompt caching is one lever. The Token Diet is a free, no-email-required field guide to the fastest token-cost wins you can ship this week, written by the same author. When you want the full prefix-caching workflow for a specific stack, the paid playbooks go deeper: Claude Edition and ChatGPT / Codex Edition.

Get The Token Diet (free) →

Govyn is an open-source API proxy for AI agent governance. Semantic caching, prompt caching passthrough, model routing, and budget enforcement. MIT licensed. Self-host or cloud-hosted.

Start saving →