Semantic Caching for AI APIs

Every team running AI agents at scale hits the same wall: LLM API costs grow linearly with request volume, even when a significant portion of those requests are semantically identical. An agent answering customer questions will encounter the same intent phrased dozens of different ways. A code review agent will see similar diffs across pull requests. A knowledge base search agent will field the same conceptual query with different wording every day.

Semantic caching solves this by storing LLM responses indexed by the meaning of the prompt, not its exact text. When a new request arrives at the Govyn proxy, the system generates a lightweight embedding of the prompt and checks the cache for stored responses with similar embeddings above a configurable similarity threshold. If a match is found, the cached response is returned in under 10 milliseconds — no provider API call, no token consumption, no latency penalty.

The result is straightforward: teams running workloads with repetitive semantic patterns typically see 40-60% cost reduction and sub-10ms response times on cache hits, with zero code changes to their agents.

The problem with traditional caching

Traditional caching systems use exact-match keys. If you cache the response to "What is the capital of France?", the cache only helps when the exact same string appears again. But users and agents rarely repeat themselves verbatim. The same question arrives as "What's France's capital?", "Tell me the capital city of France", "capital of france?", or "Which city is the capital of France?".

Exact-match caching misses all of these. Each variation triggers a full LLM inference call, consuming tokens and adding latency, even though the answer is identical every time. In production workloads, we observe that exact-match caching typically achieves hit rates below 5%, because natural language queries almost never repeat character-for-character.

Some teams attempt to normalize prompts before caching — lowercasing, stripping punctuation, removing whitespace. This helps marginally but still fails on rephrased queries, synonyms, and structural variations. The fundamental problem is that text similarity is not semantic similarity. Two prompts can share zero words in common and still mean exactly the same thing.

Provider-level prompt caching (like Anthropic's prompt caching or OpenAI's cached completions) operates at a different layer entirely. These systems cache the internal KV-cache state for long shared prefixes, reducing processing time when the same system prompt or context prefix is reused. They do not cache full responses based on semantic similarity, and they are provider-specific — a cached prefix on Anthropic provides no benefit for OpenAI requests. Provider caching and semantic caching are complementary, not competing, approaches.

How semantic caching works

Govyn's semantic caching operates at the proxy layer, transparently intercepting requests before they reach the upstream LLM provider. The process follows a consistent flow for every request:

Request arrives — an agent sends an LLM request to the Govyn proxy. The proxy extracts the prompt content from the request body (the messages array for chat completions, or the prompt field for legacy completions).
Embedding generation — the proxy generates a compact vector embedding of the prompt content using a lightweight embedding model that runs locally within the proxy process. This step adds approximately 2-5ms of latency.
Cache lookup — the proxy performs a nearest-neighbor search against the cache index, looking for stored embeddings with cosine similarity above the configured threshold (default: 0.95).
Cache hit — if a match is found, the stored response is returned immediately. The response is delivered in the same format the provider would use, including proper SSE framing for streaming requests. Total response time: under 10ms.
Cache miss — if no match exceeds the threshold, the request is forwarded to the upstream provider normally. When the provider response completes, the prompt embedding and full response are stored in the cache with the configured TTL.

The embedding model runs inside the proxy process — no external API calls are required for embedding generation. This keeps the overhead minimal and avoids adding another external dependency or cost center. The cache index uses an efficient approximate nearest-neighbor (ANN) data structure that scales to millions of entries with constant-time lookups.

Cache entries are keyed by the combination of prompt embedding, model identifier, and relevant request parameters (temperature, max tokens, etc.). This means a request for gpt-4o and a request for gpt-4o-mini with the same prompt are cached separately, since different models produce different outputs.

Cost savings analysis

The cost impact of semantic caching depends on two factors: your cache hit rate (what percentage of requests match a cached response) and your average cost per request. Here is the math for common scenarios:

Scenario 1: Customer support agent

A customer support triage agent handles 2,000 requests per day. Support queries are highly repetitive — customers ask the same questions with different wording. With semantic caching at a 0.93 threshold, the cache hit rate is approximately 55%.

Average cost per request: $0.025 (GPT-4o, ~800 input tokens, ~400 output tokens)
Daily requests: 2,000
Cache hit rate: 55% (1,100 requests served from cache)
Daily savings: 1,100 x $0.025 = $27.50/day
Monthly savings: $825/month

Scenario 2: Code review agent

A code review agent processes 500 pull requests per day, making an average of 3 LLM calls per PR (summary, issues, suggestions). Many diffs are structurally similar — dependency updates, formatting changes, boilerplate additions. Cache hit rate at 0.95 threshold: approximately 35%.

Average cost per request: $0.04 (Claude Sonnet, ~2,000 input tokens, ~600 output tokens)
Daily requests: 1,500
Cache hit rate: 35% (525 requests served from cache)
Daily savings: 525 x $0.04 = $21.00/day
Monthly savings: $630/month

Scenario 3: FAQ and knowledge base agent

An internal FAQ agent answers employee questions about company policies, benefits, IT procedures. The question space is relatively bounded and highly repetitive. Cache hit rate at 0.92 threshold: approximately 65%.

Average cost per request: $0.015 (GPT-4o-mini, ~500 input tokens, ~300 output tokens)
Daily requests: 3,000
Cache hit rate: 65% (1,950 requests served from cache)
Daily savings: 1,950 x $0.015 = $29.25/day
Monthly savings: $877.50/month

In all three scenarios, the monthly savings exceed the cost of the Govyn Team plan ($99/month) by a significant margin. The cache pays for itself within the first few days of operation.

Latency improvements

LLM inference is inherently slow. A typical chat completion request takes 1-5 seconds for the full response, depending on the model, prompt length, and output length. Streaming helps perceived latency (the first token arrives sooner), but the total response time remains measured in seconds.

Semantic cache hits bypass the inference step entirely. The cached response is retrieved from the local cache index and returned to the client in under 10 milliseconds — typically 3-7ms. This represents a 100-500x latency improvement compared to a full provider round-trip.

For interactive applications, this difference is transformative. A customer support chatbot that responds in 5ms feels instantaneous. A code review agent that returns cached feedback immediately can process pull requests in batch without blocking developer workflows. An FAQ bot that answers common questions before the user finishes typing creates a fundamentally better user experience.

Even the embedding generation step (2-5ms) and cache lookup (under 1ms) are negligible. The total overhead on a cache miss — the time added before forwarding to the provider — is under 10ms. For a 2-second LLM inference call, this represents less than 0.5% additional latency.

Configuring semantic caching

Semantic caching is configured through YAML policies in your Govyn configuration file. The policy defines which agents, models, or request patterns should have caching enabled, along with threshold and TTL settings.

Here is a basic configuration that enables semantic caching for all agents:

policies:
  - name: semantic-cache-global
    type: semantic_cache
    enabled: true
    config:
      similarity_threshold: 0.95
      ttl: 3600            # Cache entries expire after 1 hour
      max_entries: 50000   # Maximum cache size
      models:
        - gpt-4o
        - gpt-4o-mini
        - claude-sonnet-4-20250514
        - claude-haiku-4-20250414

You can also configure caching per agent or per model, with different thresholds for different use cases:

policies:
  # Aggressive caching for the FAQ bot
  - name: faq-cache
    type: semantic_cache
    enabled: true
    agents:
      - faq-bot
      - support-triage
    config:
      similarity_threshold: 0.92
      ttl: 86400           # 24-hour TTL for stable FAQ content
      max_entries: 100000

  # Conservative caching for code review
  - name: code-review-cache
    type: semantic_cache
    enabled: true
    agents:
      - code-reviewer
    config:
      similarity_threshold: 0.97
      ttl: 1800            # 30-minute TTL for fresher results
      max_entries: 20000

  # Disable caching for creative writing agents
  - name: no-cache-creative
    type: semantic_cache
    enabled: false
    agents:
      - content-writer
      - marketing-copy

The similarity_threshold parameter controls how similar two prompts must be to trigger a cache hit. Values range from 0.0 (match everything) to 1.0 (exact match only). The sweet spot for most production workloads is 0.92-0.97.

The ttl parameter sets the time-to-live for cache entries in seconds. After this period, entries are evicted and subsequent requests trigger fresh provider calls. Set shorter TTLs for rapidly changing content and longer TTLs for stable reference data.

Use cases

Semantic caching delivers the highest value in workloads where the same conceptual questions recur with different phrasing. Here are the use cases where teams see the strongest results:

FAQ and knowledge base bots

FAQ bots operate over a bounded question space. Customers and employees ask the same core questions with endless variations in phrasing, grammar, and specificity. A well-configured semantic cache can serve 50-70% of requests from cache, dramatically reducing costs and response times. The cache effectively learns the question space over time, getting more efficient as it accumulates entries.

Customer support triage

Support triage agents classify incoming tickets by category, urgency, and routing destination. Since support issues cluster around common patterns (billing questions, password resets, feature requests, bug reports), a large fraction of triage classifications are semantically redundant. Caching these classifications reduces costs and ensures consistent routing.

Code review agents

Code review agents analyze pull requests for issues, style violations, and improvement suggestions. Many diffs are structurally similar — dependency version bumps, auto-formatted changes, boilerplate additions, and test file updates follow predictable patterns. Caching reviews for similar diffs avoids redundant inference on routine changes.

Data analysis queries

Agents that translate natural language to SQL, summarize datasets, or generate charts often receive semantically equivalent requests. "Show me revenue by month" and "Monthly revenue breakdown" produce the same analysis. Caching prevents redundant computation on equivalent analytical requests.

Content moderation

Moderation agents evaluate content against policy guidelines. Many content items are similar in nature — the same types of spam, the same categories of policy violations. Semantic caching allows the moderation system to instantly classify content that resembles previously evaluated items.

Document summarization

When multiple users request summaries of the same document or similar documents, semantic caching prevents redundant summarization calls. This is especially effective for shared knowledge bases, internal documentation, and report generation where the source material changes infrequently.

Supported providers

Semantic caching operates at the proxy layer, making it provider-agnostic. The cache is keyed by prompt semantics, not provider-specific request formats. This means cached responses work across providers when the underlying prompt meaning is equivalent.

Caching is supported for all providers that Govyn routes to:

OpenAI — GPT-4o, GPT-4o-mini, o1, o3, and all chat completion models
Anthropic — Claude Opus, Sonnet, Haiku (all versions)
Azure OpenAI — all deployed models via Azure endpoints
Google Gemini — Gemini Pro, Flash, and Ultra
Mistral — Mistral Large, Medium, and Small
Cohere — Command R, Command R+
Ollama — any locally-hosted model
Any OpenAI-compatible API — vLLM, llama.cpp, LocalAI, and others

Cache entries are segregated by model identifier. A response cached from gpt-4o will not be served for a claude-sonnet-4-20250514 request with the same prompt, since different models produce different outputs. If you want cross-model caching (useful for smart model routing scenarios), you can configure a shared cache namespace in the policy.

See the full list of supported providers on the integrations page.

Traditional caching vs semantic caching

The following comparison illustrates how semantic caching differs from exact-match caching and provider-level prompt caching:

Capability	Exact Match	Provider Prompt Caching	Semantic Caching (Govyn)
Cache hit rate	Under 5%	Varies (prefix-dependent)	40-65%
Handles rephrased queries	No	No	Yes
Cost savings	Minimal	Up to 90% on input tokens	40-60% on total costs
Latency on hit	Under 1ms	Reduced, not eliminated	Under 10ms
Cross-provider	No	No	Yes
Requires code changes	Yes (cache key logic)	No (provider-managed)	No (proxy-managed)
Configurable threshold	No (binary match)	No	Yes (0.0-1.0)
Full response caching	Yes	No (reduces computation only)	Yes
TTL control	Yes	Provider-managed	Yes
Works with any provider	Implementation-dependent	Provider-specific	Yes

Provider-level prompt caching and Govyn semantic caching can be used together. Provider caching reduces input token processing costs on the provider side, while semantic caching eliminates entire API calls for semantically equivalent requests. The two approaches are complementary.

Cache management

Effective caching requires visibility and control. Govyn provides tools for monitoring cache performance and managing cache lifecycle.

TTL settings

Every cache entry has a time-to-live (TTL) measured in seconds. When the TTL expires, the entry is evicted and the next semantically similar request triggers a fresh provider call. Set TTL based on how frequently the underlying information changes:

Static reference data (company policies, product specs): 86,400 seconds (24 hours) or longer
Semi-stable content (documentation, FAQs): 3,600-14,400 seconds (1-4 hours)
Dynamic content (news summaries, market data): 300-900 seconds (5-15 minutes)
Rapidly changing data (real-time feeds): disable caching or use very short TTLs

Cache invalidation

Beyond TTL-based expiration, you can invalidate cache entries manually through the dashboard or API. Invalidation options include clearing entries for a specific agent, clearing entries for a specific model, clearing entries older than a specified date, and flushing the entire cache. Manual invalidation is useful when you know the underlying data has changed — for example, after updating your knowledge base or product documentation.

Monitoring cache hit rates

The Govyn dashboard displays real-time cache metrics including overall cache hit rate (as a percentage of total requests), cache hit rate by agent and by model, estimated cost savings from cache hits, average similarity score on cache hits, cache size and utilization, and latency distribution for cached versus uncached responses. These metrics help you tune similarity thresholds and TTL values for optimal performance.

Limitations and trade-offs

Semantic caching is not appropriate for every workload. Understanding its limitations helps you configure it correctly and avoid returning stale or inappropriate cached responses.

Real-time data queries

If your agent queries for current stock prices, live sports scores, breaking news, or any data that changes by the minute, caching will return stale results. Either disable caching for these agents or use very short TTLs (under 60 seconds) to limit staleness.

Creative and generative tasks

When variation is the goal — creative writing, brainstorming, marketing copy generation — caching defeats the purpose. Users expect different outputs for the same prompt. Disable caching for agents whose value comes from output diversity.

Long context windows

Semantic similarity is computed on the prompt content. For very long prompts (50,000+ tokens), the embedding may not capture all relevant details, and two prompts that are 95% similar may differ in a critical detail buried deep in the context. Use higher similarity thresholds (0.97+) for long-context workloads, or disable caching entirely for prompts above a certain length.

Personalized responses

If the LLM response depends on user-specific context that is not part of the prompt (e.g., user preferences stored in a system prompt that varies per user), two semantically similar user prompts may require different responses. Ensure that all context-dependent information is included in the cache key, or scope caching to agents where responses are user-independent.

Embedding model accuracy

The quality of semantic caching depends on the embedding model's ability to capture prompt meaning. Highly technical, domain-specific, or multi-lingual prompts may have lower embedding quality than general English text. Monitor cache hit quality (are cached responses actually appropriate?) and adjust thresholds accordingly.

Frequently asked questions

What is semantic caching?

Semantic caching stores LLM responses indexed by the semantic meaning of the prompt, not the exact text. When a new request arrives, the system generates an embedding and checks for cached responses with similar embeddings above a configurable similarity threshold. This catches semantically identical prompts that differ in wording, phrasing, or formatting.

How much can semantic caching save on LLM costs?

Most teams see 40-60% cost reduction on workloads with repetitive queries. FAQ bots, customer support agents, and code review systems often have the highest cache hit rates. The exact savings depend on how many semantically similar requests your agents make.

Does semantic caching work across different LLM providers?

Yes. Govyn semantic caching is provider-agnostic. The cache is keyed by prompt semantics, not provider format. A cached response from an OpenAI request can serve a semantically identical Anthropic request if both are routed through the proxy with caching enabled.

What similarity threshold should I use?

A threshold of 0.92-0.95 works well for most use cases. Higher thresholds (0.97+) are more conservative and only match near-identical prompts. Lower thresholds (0.85-0.90) are more aggressive and catch broader semantic similarity but risk returning less relevant cached responses. Start at 0.95 and adjust based on your cache hit rate and response quality.

How fast are cached responses?

Cached responses return in under 10 milliseconds, compared to 1-5 seconds for a full LLM inference call. This represents a 100-500x latency improvement on cache hits.

Can I invalidate the semantic cache?

Yes. You can invalidate the cache by setting TTL (time-to-live) values on cached entries, clearing the cache for specific agents or models, or flushing the entire cache. Cache invalidation is available via the dashboard and the API.

Does semantic caching work with streaming responses?

Yes. When a cache hit occurs on a streaming request, Govyn replays the cached response as a stream, preserving the same SSE event format the client expects. The client receives the cached content as if it were streamed from the provider.

What embedding model does semantic caching use?

Govyn uses a lightweight embedding model optimized for prompt similarity detection. The embedding generation adds minimal overhead (typically under 5ms) and runs locally within the proxy process. No external embedding API calls are required.

Is semantic caching suitable for all types of LLM requests?

No. Semantic caching works best for deterministic, factual, or template-based queries. It is not suitable for creative tasks where variation is desired, real-time data queries where freshness matters, or personalized responses that depend on user-specific context not captured in the prompt.

How does semantic caching interact with other Govyn policies?

Semantic caching is evaluated after authentication and policy checks. If a request is blocked by a budget or rate limit policy, it never reaches the cache layer. Cached responses still count toward activity logging but do not count toward budget consumption since no provider API call is made.

What plan includes semantic caching?

Semantic caching is available on Team plans ($99/month) and above. Starter plans use exact-match caching only. Enterprise plans include advanced cache analytics and custom embedding model support.

Can I monitor cache performance?

Yes. The Govyn dashboard shows cache hit rate, cache size, average similarity scores on hits, estimated cost savings from cache hits, and latency comparison between cached and uncached requests. You can filter these metrics by agent, model, and time period.

Enable semantic caching on Team plan See pricing