Semantic Caching for AI APIs
Every team running AI agents at scale hits the same wall: LLM API costs grow linearly with request volume, even when a significant portion of those requests are semantically identical. An agent answering customer questions will encounter the same intent phrased dozens of different ways. A code review agent will see similar diffs across pull requests. A knowledge base search agent will field the same conceptual query with different wording every day.
Semantic caching solves this by storing LLM responses indexed by the meaning of the prompt, not its exact text. When a new request arrives at the Govyn proxy, the system generates a lightweight embedding of the prompt and checks the cache for stored responses with similar embeddings above a configurable similarity threshold. If a match is found, the cached response is returned in under 10 milliseconds — no provider API call, no token consumption, no latency penalty.
The result is straightforward: teams running workloads with repetitive semantic patterns typically see 40-60% cost reduction and sub-10ms response times on cache hits, with zero code changes to their agents.
The problem with traditional caching
Traditional caching systems use exact-match keys. If you cache the response to "What is the capital of France?", the cache only helps when the exact same string appears again. But users and agents rarely repeat themselves verbatim. The same question arrives as "What's France's capital?", "Tell me the capital city of France", "capital of france?", or "Which city is the capital of France?".
Exact-match caching misses all of these. Each variation triggers a full LLM inference call, consuming tokens and adding latency, even though the answer is identical every time. In production workloads, we observe that exact-match caching typically achieves hit rates below 5%, because natural language queries almost never repeat character-for-character.
Some teams attempt to normalize prompts before caching — lowercasing, stripping punctuation, removing whitespace. This helps marginally but still fails on rephrased queries, synonyms, and structural variations. The fundamental problem is that text similarity is not semantic similarity. Two prompts can share zero words in common and still mean exactly the same thing.
Provider-level prompt caching (like Anthropic's prompt caching or OpenAI's cached completions) operates at a different layer entirely. These systems cache the internal KV-cache state for long shared prefixes, reducing processing time when the same system prompt or context prefix is reused. They do not cache full responses based on semantic similarity, and they are provider-specific — a cached prefix on Anthropic provides no benefit for OpenAI requests. Provider caching and semantic caching are complementary, not competing, approaches.
How semantic caching works
Govyn's semantic caching operates at the proxy layer, transparently intercepting requests before they reach the upstream LLM provider. The process follows a consistent flow for every request:
- Request arrives — an agent sends an LLM request to the Govyn proxy. The proxy extracts the prompt content from the request body (the
messagesarray for chat completions, or thepromptfield for legacy completions). - Embedding generation — the proxy generates a compact vector embedding of the prompt content using a lightweight embedding model that runs locally within the proxy process. This step adds approximately 2-5ms of latency.
- Cache lookup — the proxy performs a nearest-neighbor search against the cache index, looking for stored embeddings with cosine similarity above the configured threshold (default: 0.95).
- Cache hit — if a match is found, the stored response is returned immediately. The response is delivered in the same format the provider would use, including proper SSE framing for streaming requests. Total response time: under 10ms.
- Cache miss — if no match exceeds the threshold, the request is forwarded to the upstream provider normally. When the provider response completes, the prompt embedding and full response are stored in the cache with the configured TTL.
The embedding model runs inside the proxy process — no external API calls are required for embedding generation. This keeps the overhead minimal and avoids adding another external dependency or cost center. The cache index uses an efficient approximate nearest-neighbor (ANN) data structure that scales to millions of entries with constant-time lookups.
Cache entries are keyed by the combination of prompt embedding, model identifier, and relevant request parameters (temperature, max tokens, etc.). This means a request for gpt-4o and a request for gpt-4o-mini with the same prompt are cached separately, since different models produce different outputs.
Cost savings analysis
The cost impact of semantic caching depends on two factors: your cache hit rate (what percentage of requests match a cached response) and your average cost per request. Here is the math for common scenarios:
Scenario 1: Customer support agent
A customer support triage agent handles 2,000 requests per day. Support queries are highly repetitive — customers ask the same questions with different wording. With semantic caching at a 0.93 threshold, the cache hit rate is approximately 55%.
- Average cost per request: $0.025 (GPT-4o, ~800 input tokens, ~400 output tokens)
- Daily requests: 2,000
- Cache hit rate: 55% (1,100 requests served from cache)
- Daily savings: 1,100 x $0.025 = $27.50/day
- Monthly savings: $825/month
Scenario 2: Code review agent
A code review agent processes 500 pull requests per day, making an average of 3 LLM calls per PR (summary, issues, suggestions). Many diffs are structurally similar — dependency updates, formatting changes, boilerplate additions. Cache hit rate at 0.95 threshold: approximately 35%.
- Average cost per request: $0.04 (Claude Sonnet, ~2,000 input tokens, ~600 output tokens)
- Daily requests: 1,500
- Cache hit rate: 35% (525 requests served from cache)
- Daily savings: 525 x $0.04 = $21.00/day
- Monthly savings: $630/month
Scenario 3: FAQ and knowledge base agent
An internal FAQ agent answers employee questions about company policies, benefits, IT procedures. The question space is relatively bounded and highly repetitive. Cache hit rate at 0.92 threshold: approximately 65%.
- Average cost per request: $0.015 (GPT-4o-mini, ~500 input tokens, ~300 output tokens)
- Daily requests: 3,000
- Cache hit rate: 65% (1,950 requests served from cache)
- Daily savings: 1,950 x $0.015 = $29.25/day
- Monthly savings: $877.50/month
In all three scenarios, the monthly savings exceed the cost of the Govyn Team plan ($99/month) by a significant margin. The cache pays for itself within the first few days of operation.
Latency improvements
LLM inference is inherently slow. A typical chat completion request takes 1-5 seconds for the full response, depending on the model, prompt length, and output length. Streaming helps perceived latency (the first token arrives sooner), but the total response time remains measured in seconds.
Semantic cache hits bypass the inference step entirely. The cached response is retrieved from the local cache index and returned to the client in under 10 milliseconds — typically 3-7ms. This represents a 100-500x latency improvement compared to a full provider round-trip.
For interactive applications, this difference is transformative. A customer support chatbot that responds in 5ms feels instantaneous. A code review agent that returns cached feedback immediately can process pull requests in batch without blocking developer workflows. An FAQ bot that answers common questions before the user finishes typing creates a fundamentally better user experience.
Even the embedding generation step (2-5ms) and cache lookup (under 1ms) are negligible. The total overhead on a cache miss — the time added before forwarding to the provider — is under 10ms. For a 2-second LLM inference call, this represents less than 0.5% additional latency.
Configuring semantic caching
Semantic caching is configured through YAML policies in your Govyn configuration file. The policy defines which agents, models, or request patterns should have caching enabled, along with threshold and TTL settings.
Here is a basic configuration that enables semantic caching for all agents:
policies:
- name: semantic-cache-global
type: semantic_cache
enabled: true
config:
similarity_threshold: 0.95
ttl: 3600 # Cache entries expire after 1 hour
max_entries: 50000 # Maximum cache size
models:
- gpt-4o
- gpt-4o-mini
- claude-sonnet-4-20250514
- claude-haiku-4-20250414 You can also configure caching per agent or per model, with different thresholds for different use cases:
policies:
# Aggressive caching for the FAQ bot
- name: faq-cache
type: semantic_cache
enabled: true
agents:
- faq-bot
- support-triage
config:
similarity_threshold: 0.92
ttl: 86400 # 24-hour TTL for stable FAQ content
max_entries: 100000
# Conservative caching for code review
- name: code-review-cache
type: semantic_cache
enabled: true
agents:
- code-reviewer
config:
similarity_threshold: 0.97
ttl: 1800 # 30-minute TTL for fresher results
max_entries: 20000
# Disable caching for creative writing agents
- name: no-cache-creative
type: semantic_cache
enabled: false
agents:
- content-writer
- marketing-copy
The similarity_threshold parameter controls how similar two prompts must be to trigger a cache hit. Values range from 0.0 (match everything) to 1.0 (exact match only). The sweet spot for most production workloads is 0.92-0.97.
The ttl parameter sets the time-to-live for cache entries in seconds. After this period, entries are evicted and subsequent requests trigger fresh provider calls. Set shorter TTLs for rapidly changing content and longer TTLs for stable reference data.
Use cases
Semantic caching delivers the highest value in workloads where the same conceptual questions recur with different phrasing. Here are the use cases where teams see the strongest results:
FAQ and knowledge base bots
FAQ bots operate over a bounded question space. Customers and employees ask the same core questions with endless variations in phrasing, grammar, and specificity. A well-configured semantic cache can serve 50-70% of requests from cache, dramatically reducing costs and response times. The cache effectively learns the question space over time, getting more efficient as it accumulates entries.
Customer support triage
Support triage agents classify incoming tickets by category, urgency, and routing destination. Since support issues cluster around common patterns (billing questions, password resets, feature requests, bug reports), a large fraction of triage classifications are semantically redundant. Caching these classifications reduces costs and ensures consistent routing.
Code review agents
Code review agents analyze pull requests for issues, style violations, and improvement suggestions. Many diffs are structurally similar — dependency version bumps, auto-formatted changes, boilerplate additions, and test file updates follow predictable patterns. Caching reviews for similar diffs avoids redundant inference on routine changes.
Data analysis queries
Agents that translate natural language to SQL, summarize datasets, or generate charts often receive semantically equivalent requests. "Show me revenue by month" and "Monthly revenue breakdown" produce the same analysis. Caching prevents redundant computation on equivalent analytical requests.
Content moderation
Moderation agents evaluate content against policy guidelines. Many content items are similar in nature — the same types of spam, the same categories of policy violations. Semantic caching allows the moderation system to instantly classify content that resembles previously evaluated items.
Document summarization
When multiple users request summaries of the same document or similar documents, semantic caching prevents redundant summarization calls. This is especially effective for shared knowledge bases, internal documentation, and report generation where the source material changes infrequently.
Supported providers
Semantic caching operates at the proxy layer, making it provider-agnostic. The cache is keyed by prompt semantics, not provider-specific request formats. This means cached responses work across providers when the underlying prompt meaning is equivalent.
Caching is supported for all providers that Govyn routes to:
- OpenAI — GPT-4o, GPT-4o-mini, o1, o3, and all chat completion models
- Anthropic — Claude Opus, Sonnet, Haiku (all versions)
- Azure OpenAI — all deployed models via Azure endpoints
- Google Gemini — Gemini Pro, Flash, and Ultra
- Mistral — Mistral Large, Medium, and Small
- Cohere — Command R, Command R+
- Ollama — any locally-hosted model
- Any OpenAI-compatible API — vLLM, llama.cpp, LocalAI, and others
Cache entries are segregated by model identifier. A response cached from gpt-4o will not be served for a claude-sonnet-4-20250514 request with the same prompt, since different models produce different outputs. If you want cross-model caching (useful for smart model routing scenarios), you can configure a shared cache namespace in the policy.
See the full list of supported providers on the integrations page.
Traditional caching vs semantic caching
The following comparison illustrates how semantic caching differs from exact-match caching and provider-level prompt caching:
| Capability | Exact Match | Provider Prompt Caching | Semantic Caching (Govyn) |
|---|---|---|---|
| Cache hit rate | Under 5% | Varies (prefix-dependent) | 40-65% |
| Handles rephrased queries | No | No | Yes |
| Cost savings | Minimal | Up to 90% on input tokens | 40-60% on total costs |
| Latency on hit | Under 1ms | Reduced, not eliminated | Under 10ms |
| Cross-provider | No | No | Yes |
| Requires code changes | Yes (cache key logic) | No (provider-managed) | No (proxy-managed) |
| Configurable threshold | No (binary match) | No | Yes (0.0-1.0) |
| Full response caching | Yes | No (reduces computation only) | Yes |
| TTL control | Yes | Provider-managed | Yes |
| Works with any provider | Implementation-dependent | Provider-specific | Yes |
Provider-level prompt caching and Govyn semantic caching can be used together. Provider caching reduces input token processing costs on the provider side, while semantic caching eliminates entire API calls for semantically equivalent requests. The two approaches are complementary.
Cache management
Effective caching requires visibility and control. Govyn provides tools for monitoring cache performance and managing cache lifecycle.
TTL settings
Every cache entry has a time-to-live (TTL) measured in seconds. When the TTL expires, the entry is evicted and the next semantically similar request triggers a fresh provider call. Set TTL based on how frequently the underlying information changes:
- Static reference data (company policies, product specs): 86,400 seconds (24 hours) or longer
- Semi-stable content (documentation, FAQs): 3,600-14,400 seconds (1-4 hours)
- Dynamic content (news summaries, market data): 300-900 seconds (5-15 minutes)
- Rapidly changing data (real-time feeds): disable caching or use very short TTLs
Cache invalidation
Beyond TTL-based expiration, you can invalidate cache entries manually through the dashboard or API. Invalidation options include clearing entries for a specific agent, clearing entries for a specific model, clearing entries older than a specified date, and flushing the entire cache. Manual invalidation is useful when you know the underlying data has changed — for example, after updating your knowledge base or product documentation.
Monitoring cache hit rates
The Govyn dashboard displays real-time cache metrics including overall cache hit rate (as a percentage of total requests), cache hit rate by agent and by model, estimated cost savings from cache hits, average similarity score on cache hits, cache size and utilization, and latency distribution for cached versus uncached responses. These metrics help you tune similarity thresholds and TTL values for optimal performance.
Limitations and trade-offs
Semantic caching is not appropriate for every workload. Understanding its limitations helps you configure it correctly and avoid returning stale or inappropriate cached responses.
Real-time data queries
If your agent queries for current stock prices, live sports scores, breaking news, or any data that changes by the minute, caching will return stale results. Either disable caching for these agents or use very short TTLs (under 60 seconds) to limit staleness.
Creative and generative tasks
When variation is the goal — creative writing, brainstorming, marketing copy generation — caching defeats the purpose. Users expect different outputs for the same prompt. Disable caching for agents whose value comes from output diversity.
Long context windows
Semantic similarity is computed on the prompt content. For very long prompts (50,000+ tokens), the embedding may not capture all relevant details, and two prompts that are 95% similar may differ in a critical detail buried deep in the context. Use higher similarity thresholds (0.97+) for long-context workloads, or disable caching entirely for prompts above a certain length.
Personalized responses
If the LLM response depends on user-specific context that is not part of the prompt (e.g., user preferences stored in a system prompt that varies per user), two semantically similar user prompts may require different responses. Ensure that all context-dependent information is included in the cache key, or scope caching to agents where responses are user-independent.
Embedding model accuracy
The quality of semantic caching depends on the embedding model's ability to capture prompt meaning. Highly technical, domain-specific, or multi-lingual prompts may have lower embedding quality than general English text. Monitor cache hit quality (are cached responses actually appropriate?) and adjust thresholds accordingly.