We Cut Our AI API Bill by 73% Without Changing a Single Line of Agent Code

March 4, 2026 7 min read

cost-reduction model-routing proxy openai anthropic

How smart model routing through a proxy cut our OpenAI and Anthropic bill from $2,140/mo to $578/mo. Zero code changes. Just YAML.

The bill that started it

We were running a team of AI agents — customer support triage, code review, internal docs search, daily reporting, Slack summarization. All pointing at GPT-4o and Claude Sonnet. The standard setup.

Month one: $1,860. Month two: $2,140. Month three was trending higher.

We were not doing anything exotic. Five agents, normal workloads. But every request — whether it was summarizing a three-line Slack thread or analyzing a 200-file pull request — went to the same premium model. A two-sentence classification task cost the same per-token as a complex multi-step reasoning chain.

We looked at the request logs. The breakdown was striking:

42% of requests were under 500 input tokens — simple classifications, yes/no questions, short summaries
31% were 500-4,000 tokens — moderate tasks, standard analysis, formatting
27% were 4,000+ tokens — complex reasoning, long-context analysis, multi-step planning

Nearly three-quarters of our requests did not need a premium model. They needed a model that could read a short prompt and return a short answer. The difference in quality between GPT-4o and GPT-4o-mini for “Is this support ticket urgent? Yes or no” is negligible. The difference in cost is 16x.

Smart model routing

The concept is simple: inspect each LLM request before it reaches the provider and route it to the cheapest model that can handle it. This is what smart model routing does at the proxy level.

Short, simple requests go to a mini/haiku-class model. Medium-complexity requests go to a mid-tier model. Only the genuinely complex requests reach the premium model.

Here is the Govyn routing config that did it:

routing:
  rules:
    - name: short_to_mini
      condition:
        input_tokens_below: 500
      route_to: gpt-4o-mini

    - name: medium_to_sonnet
      condition:
        input_tokens_below: 4000
      route_to: claude-sonnet-4-6

    - name: default
      route_to: gpt-4o

Three rules. Requests under 500 tokens go to GPT-4o-mini. Requests under 4,000 tokens go to Claude Sonnet. Everything else goes to GPT-4o as configured.

The agents do not know this is happening. They send every request to the proxy URL, thinking they are calling GPT-4o. The proxy rewrites the model field and forwards to the appropriate provider. The response comes back in the same format the agent expects. Transparent.

The before and after

Here is the monthly cost comparison across our five agents:

Before (all requests to GPT-4o / Claude Sonnet)

Agent	Requests/mo	Avg tokens	Monthly cost
Support triage	12,400	380	$285
Code review	3,200	6,800	$740
Docs search	8,600	1,200	$390
Daily reports	1,800	4,200	$410
Slack summary	6,200	520	$315
Total	32,200		$2,140

After (smart routing via Govyn proxy)

Agent	Routed to mini	Routed to mid	Routed to premium	Monthly cost
Support triage	89%	11%	0%	$34
Code review	5%	22%	73%	$572
Docs search	32%	61%	7%	$108
Daily reports	8%	45%	47%	$238
Slack summary	78%	20%	2%	$52
Total				$1,004

That is a 53% reduction from routing alone. But we were not done.

Adding budget caps and loop detection

Routing was the biggest win, but the proxy gave us two more levers.

Loop detection caught our code review agent retrying failed parse operations. It would hit a malformed diff, fail to parse it, retry with a slightly different prompt, fail again, and loop. Before the proxy, these loops ran until the context window filled up. With loop detection, the proxy identified five near-identical requests in 60 seconds and blocked subsequent calls.

agents:
  code_review:
    loop_detection:
      enabled: true
      window: 60s
      max_identical_requests: 5
      similarity_threshold: 0.85
      action: block

Budget caps caught a Slack summary agent that occasionally received a dump of an entire channel history. One 50,000-token request to GPT-4o costs more than 500 normal requests to GPT-4o-mini. Daily budget caps prevented these outliers from blowing the monthly budget.

agents:
  slack_summary:
    budget:
      daily: $5.00
      alert_at: 80%

With routing, loop detection, and budget caps combined:

	Before	After	Savings
Monthly cost	$2,140	$578	$1,562 (73%)
Wasted loop requests	~1,800/mo	0	100% eliminated
Budget overrun incidents	3 in 2 months	0	100% eliminated

Seventy-three percent reduction. Same agents. Same code. Same prompts. Same outputs. The agents did not know anything changed.

Why only a proxy can do this

An SDK wrapper cannot transparently rewrite the model field in an outgoing request. Here is why:

The SDK wrapper runs inside the agent process. The agent calls openai.chat.completions.create(model="gpt-4o", ...). The wrapper can intercept this call, but the agent controls the HTTP client. If the agent uses a different client, imports the library directly, or makes a raw HTTP request, the wrapper is bypassed. The model field arrives at OpenAI exactly as the agent specified.

A proxy intercepts the HTTP request at the network layer. The request arrives at the proxy with "model": "gpt-4o". The proxy rewrites it to "model": "gpt-4o-mini" and forwards. The agent’s HTTP client, library version, and language are irrelevant. The proxy controls the request after it leaves the agent process and before it reaches the provider. For a deeper dive into this architecture, see our post on proxy vs SDK governance.

This is the same reason proxy-level budget enforcement works where SDK-level enforcement does not. The SDK tracks spending inside the process that holds the API key. The proxy tracks spending outside the process, and the process has no key.

For cost optimization specifically, the proxy model has another advantage: centralized metrics. When five agents route through one proxy, you get a single dashboard showing cost per agent, cost per model, cost per hour, and cost trends. With SDK wrappers, each agent tracks independently. Correlating costs across agents means aggregating logs from five different processes.

The config in full

Here is the complete Govyn configuration that achieved the 73% reduction. This is the entire file:

# govyn.yaml
proxy:
  listen: "0.0.0.0:4000"

providers:
  openai:
    api_key: "${OPENAI_API_KEY}"
  anthropic:
    api_key: "${ANTHROPIC_API_KEY}"

routing:
  rules:
    - name: short_to_mini
      condition:
        input_tokens_below: 500
      route_to: gpt-4o-mini
    - name: medium_to_sonnet
      condition:
        input_tokens_below: 4000
      route_to: claude-sonnet-4-6
    - name: default
      route_to: gpt-4o

agents:
  code_review:
    loop_detection:
      enabled: true
      window: 60s
      max_identical_requests: 5
      similarity_threshold: 0.85
      action: block
  slack_summary:
    budget:
      daily: $5.00
      alert_at: 80%

global:
  budget:
    monthly: $800.00
    alert_at: 80%

Twenty-eight lines of YAML. No code changes. No agent modifications. No library upgrades. Point your agents at the proxy URL instead of the provider URL. Done.

Try it

Install Govyn, paste the config above (swap in your API keys), and point one agent at the proxy. Run it for a day. Compare the cost to the previous day.

The numbers will speak for themselves.

Free guide

The Token Diet: the token-cost wins you can ship this week

Routing is the biggest lever, but it is not the only one. The Token Diet is a free, no-email-required field guide to cutting AI spend before you re-architect anything. When you want the full prefix-caching workflow for a specific provider, the paid playbooks go deeper: Claude Edition and ChatGPT / Codex Edition.

Get The Token Diet (free) →

Govyn is an open-source API proxy for AI agent governance. MIT licensed. Self-host or cloud-hosted.

Start saving →

We Cut Our AI API Bill by 73% Without Changing a Single Line of Agent Code

The bill that started it

Smart model routing

The before and after

Before (all requests to GPT-4o / Claude Sonnet)

After (smart routing via Govyn proxy)

Adding budget caps and loop detection

Why only a proxy can do this

The config in full

Try it

Related posts

The EU AI Act Takes Effect in August. Here's What Your AI Infrastructure Needs to Do.

MCP Security: Why Tool-Use Agents Are Your Biggest Attack Surface

Prompt Caching vs Semantic Caching: Which One Do You Actually Need?

Explore more