Trust but Verify: How to Detect Token Count Manipulation in AI API Pipelines

24 min read

How to independently verify provider-reported token counts using BPE estimation, catch discrepancies before they inflate your AI bill, and build cost integrity into your pipeline.


The invoice nobody questioned

A team running 40 AI agents across three providers noticed something in their March invoice. Token usage for their Anthropic agents was up 23% month-over-month. Request volume was flat. Prompt lengths had not changed. Output quality was the same. But the bill was $4,200 higher than February.

They investigated. Anthropic had updated their tokenizer between billing periods. The same prompt that tokenized to 1,200 tokens in February now reported 1,480 tokens. The responses were identical. The work was identical. The cost was not.

Nobody on the team caught it for 31 days. They had no independent way to verify the token counts their providers reported.

This is the cost integrity problem. You are billed based on numbers that your provider calculates, using a tokenizer that your provider controls, reported through an API that your provider operates. You have no second opinion. You accept the usage field in the API response as ground truth, multiply by the published per-token price, and pay the invoice.

For a 10-agent operation spending $5,000/month on LLM calls, a silent 15% overcount costs $9,000/year. At scale — hundreds of agents, multiple providers — the exposure is significant. Not because providers are malicious. Because systems have bugs, tokenizers get updated, streaming parsers drop or duplicate chunks, and nobody is checking.


Why you cannot blindly trust provider-reported token counts

Provider-reported token counts are the only input to your AI bill. Every dollar you pay is a function of (input_tokens + output_tokens) * price_per_token. You control none of those variables.

There are at least five categories of discrepancy:

Tokenizer updates. Providers periodically update their tokenizers. OpenAI has shipped at least three encoding schemes (r50k_base, cl100k_base, o200k_base) since GPT-3. Each one changes how the same text maps to tokens. An encoding change does not change the quality of the response. It changes the price.

Streaming duplication. When using server-sent events (SSE) for streaming responses, the usage data arrives as a separate event at the end of the stream. But some provider implementations double-count partial chunks, especially when the stream is interrupted and retried. The usage field says 2,400 output tokens. Your independently counted output is 1,100 tokens.

Overhead tokens. Providers add system tokens that are not part of your prompt or response — conversation delimiters, role markers, tool-use framing, safety preambles. These are real tokens consumed by the model, but they are invisible to you. The question is whether you should pay for them. At minimum, you should know they exist.

Provider bugs. In 2024, a widely-reported OpenAI API issue caused streaming responses to report usage objects with inflated completion_tokens for certain function-calling prompts. The responses were correct. The billing was wrong. It was fixed within days. Teams without independent verification paid the inflated rate until the fix shipped.

Compromised endpoints. If your agent traffic routes through a compromised intermediary (a misconfigured proxy, a malicious cache layer, a man-in-the-middle on your egress), the usage field in the response can be rewritten. The response content looks correct. The token counts are fabricated. Your cost data is fiction.

None of these scenarios require malice. They require only that you have a single source of truth for a number that determines your bill. The fix is a second source.


How token counting actually works: BPE tokenization

Before we can verify token counts, we need to understand what a token count is.

Large language models do not process text character by character. They process it in chunks called tokens. The mapping from text to tokens is determined by a tokenizer — specifically, a Byte Pair Encoding (BPE) tokenizer.

BPE works by iteratively merging the most frequent pair of bytes (or characters) in a training corpus until a target vocabulary size is reached. The result is a lookup table — called a rank table — that maps byte sequences to token IDs. Common words become single tokens. Rare words get split into subword pieces. The word “tokenization” might become two tokens: “token” + “ization”. The word “defenestration” might become three: “def” + “en” + “estration”.

The key insight for cost verification: BPE is deterministic. Given the same rank table, the same input text always produces the same token sequence. There is no randomness. If you have the rank table, you can reproduce the token count exactly.

OpenAI publishes their rank tables. The current one for GPT-4o, GPT-4.1, and o-series models is called o200k_base — a vocabulary of roughly 200,000 tokens. The previous generation (GPT-4, GPT-3.5-turbo) used cl100k_base with ~100,000 tokens.

Anthropic does not publish their tokenizer rank table. Claude models use a proprietary tokenizer. You cannot reproduce Claude’s exact token count from the outside. But you can estimate it. More on that in the threshold section below.


The approach: independent estimation with js-tiktoken

The verification strategy is straightforward: for every LLM request your system processes, independently estimate the token count using the same (or comparable) BPE encoding, then compare your estimate to the provider-reported count. If they diverge significantly, flag it.

The implementation uses js-tiktoken, a JavaScript port of OpenAI’s tiktoken library. It ships the rank tables as static data and performs BPE encoding in-process. No network calls. No external dependencies at runtime.

import { getEncoding, type Tiktoken } from "js-tiktoken";

// Singleton encoder -- created once, reused across requests
let encoder: Tiktoken | null = null;

function getTokenEncoder(): Tiktoken {
  if (!encoder) {
    encoder = getEncoding("o200k_base");
  }
  return encoder;
}

export function estimateTokenCount(text: string): number {
  if (!text) return 0;
  return getTokenEncoder().encode(text).length;
}

Three design decisions matter here.

Singleton pattern

The BPE rank table for o200k_base is approximately 4MB of data. Parsing it and constructing the encoder takes measurable time — roughly 50-100ms on a modern server. If you create a new encoder per request, a proxy handling 500 requests per second wastes 25-50 seconds of CPU time per second on encoder initialization alone.

The singleton pattern initializes the encoder once on first use and reuses it for every subsequent request. The Tiktoken encoder is stateless after initialization — encode() is a pure function of the input text. There are no concurrency issues with sharing it across requests.

Why o200k_base specifically

o200k_base is OpenAI’s current encoding for their latest model family (GPT-4o, GPT-4.1, o1, o3). It supersedes cl100k_base (GPT-4, GPT-3.5-turbo) and r50k_base (GPT-3). For OpenAI models, o200k_base gives you an exact match to the provider’s token count.

For non-OpenAI models — Claude, Gemini, Mistral, Llama — o200k_base is an approximation. Different tokenizers produce different counts for the same text. But the difference is bounded and predictable. We use this property to set our detection threshold. One encoder covers all providers with a single, well-characterized margin of error.

Input reconstruction

The estimator does not have access to the raw bytes that the provider tokenized. It reconstructs the input from the request body’s messages array and estimates the output from the response text. This means it misses overhead tokens (role markers, tool schemas, system framing) that the provider counts. This is intentional — it is another reason the threshold is set wide.


Token count verification flow


Discrepancy detection logic

The detection function runs after every proxied request completes, in both streaming and non-streaming code paths. It compares the provider-reported total (input + output tokens) against the independently estimated total.

function detectUsageDiscrepancy(
  providerTokensIn: number,
  providerTokensOut: number,
  requestBody: Record<string, unknown>,
  responseText: string,
): Record<string, unknown> | undefined {
  if (providerTokensIn === 0 && providerTokensOut === 0) return undefined;

  // Estimate input from request messages
  const messages = requestBody["messages"];
  const inputText = Array.isArray(messages)
    ? messages
        .filter((m): m is Record<string, unknown> =>
          typeof m === "object" && m !== null)
        .map((m) => {
          const c = m["content"];
          return typeof c === "string" ? c : "";
        })
        .join("\n")
    : "";

  const estimatedIn = estimateTokenCount(inputText);
  const estimatedOut = estimateTokenCount(responseText);
  const estimatedTotal = estimatedIn + estimatedOut;
  const providerTotal = providerTokensIn + providerTokensOut;

  const maxVal = Math.max(providerTotal, estimatedTotal);
  if (maxVal === 0) return undefined;

  const relDiff = Math.abs(providerTotal - estimatedTotal) / maxVal;
  if (relDiff <= 0.5) return undefined;

  return {
    discrepancy: true,
    streamed: providerTotal,
    estimated: estimatedTotal,
    provider: providerTotal,
    relativeDeviation: Math.round(relDiff * 100) / 100,
  };
}

The formula is:

relativeDeviation = |providerTotal - estimatedTotal| / max(providerTotal, estimatedTotal)

This is a relative difference normalized by the larger value. It measures the percentage gap between what the provider reported and what we independently estimated. A value of 0.0 means exact match. A value of 1.0 means one side reported tokens and the other reported zero. A value of 0.5 means one count is double the other.

When the deviation exceeds the threshold (0.5, or 50%), the function returns a metadata object describing the discrepancy. When it does not, it returns undefined — no metadata, no flag, no overhead.


Why 50%? Cross-encoding tolerance analysis

The 50% threshold is not arbitrary. It accounts for three known sources of legitimate divergence between provider-reported and estimated token counts.

Source 1: Cross-tokenizer variance

Different providers use different BPE vocabularies. The same English sentence tokenized with o200k_base (OpenAI) versus Claude’s tokenizer versus Gemini’s tokenizer will produce different token counts. Empirical testing across common prompt patterns shows:

Text typeo200k_base vs cl100k_baseo200k_base vs Claude (estimated)
English prose5-15% difference10-25% difference
JSON/structured data10-20% difference15-30% difference
Code (Python/JS)8-18% difference12-28% difference
Mixed multilingual15-30% difference20-40% difference

The worst case for legitimate cross-tokenizer variance is approximately 40% for mixed multilingual content with heavy use of non-Latin scripts. A 50% threshold sits above this worst case.

Source 2: Overhead tokens

Providers inject tokens that are not visible in the API request or response. System prompts get framing tokens. Tool-use requests include schema serialization. Multi-turn conversations include conversation separators. These overhead tokens are part of the provider’s count but absent from our estimate.

For typical requests, overhead is 5-15% of total tokens. For tool-heavy requests with large schemas, it can reach 20-30%.

Source 3: Streaming edge cases

In streaming mode, our estimate is based on the concatenated content chunks. If the stream is interrupted, retried, or if the provider’s usage event includes tokens from a retry that we did not see, the counts diverge. This is uncommon but can add 5-10% variance.

Adding these sources: 40% (cross-tokenizer) + 15% (overhead) = 55% in the extreme worst case. A 50% threshold catches anomalies that exceed all known legitimate variance while avoiding false positives on normal cross-provider differences. In practice, genuine overcharging or bugs produce deviations of 80-300%, well above the threshold.

The threshold is deliberately conservative. A false negative (missed discrepancy) costs money. A false positive (flagged normal request) costs investigation time. At this stage, we optimize for signal quality — every flag should be worth investigating.


Metadata storage: JSONB with GIN indexing

Discrepancy metadata is stored in a JSONB column on the action log table. This is the schema:

ALTER TABLE "action_logs" ADD COLUMN "metadata" JSONB;

CREATE INDEX "action_logs_metadata_idx"
  ON "action_logs" USING GIN ("metadata" jsonb_path_ops);

The jsonb_path_ops operator class is specific to the @> containment operator. It builds a smaller, faster index than the default jsonb_ops class, at the cost of not supporting key-existence (?) or top-level-key (?|, ?&) queries. Since discrepancy queries always use containment, this is the right trade-off.

Querying discrepancies

Find all flagged requests:

SELECT id, "agentIdentifier", model, "tokensIn", "tokensOut",
       metadata->>'relativeDeviation' AS deviation,
       metadata->>'provider' AS provider_total,
       metadata->>'estimated' AS estimated_total,
       "createdAt"
FROM action_logs
WHERE metadata @> '{"discrepancy": true}'
ORDER BY "createdAt" DESC;

Find discrepancies exceeding a specific threshold (e.g., 80%):

SELECT id, "agentIdentifier", model,
       (metadata->>'relativeDeviation')::float AS deviation
FROM action_logs
WHERE metadata @> '{"discrepancy": true}'
  AND (metadata->>'relativeDeviation')::float > 0.8
ORDER BY deviation DESC;

Aggregate discrepancy rate by model:

SELECT model,
  COUNT(*) FILTER (WHERE metadata @> '{"discrepancy": true}') AS flagged,
  COUNT(*) AS total,
  ROUND(
    COUNT(*) FILTER (WHERE metadata @> '{"discrepancy": true}')::numeric
    / NULLIF(COUNT(*), 0) * 100, 2
  ) AS discrepancy_pct
FROM action_logs
WHERE "createdAt" > NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY discrepancy_pct DESC;

The GIN index makes the @> containment query fast regardless of table size. On a table with 10 million action logs, the index scan for metadata @> '{"discrepancy": true}' touches only the rows that have the flag, not the entire table.

Why JSONB instead of dedicated columns

Dedicated columns (has_discrepancy BOOLEAN, deviation FLOAT, estimated_tokens INT) would be simpler to query and more storage-efficient. But the metadata column serves more than discrepancy detection. It is an extensible envelope for any per-request structured data: cache performance metrics, policy evaluation traces, routing decisions, latency breakdowns.

Adding a new metric requires zero schema changes. Write a new key to the JSONB object. Query it with @>. The GIN index covers it automatically. For a fast-moving system where the set of per-request metadata evolves frequently, this is the right trade-off between query performance and schema flexibility.


Why fire-and-forget: never block on estimation

The discrepancy detection runs after the response has already been sent to the client. It does not block the request. It does not delay the response. It does not add latency to the critical path.

// Response already sent to client
res.end();

// Fire-and-forget telemetry (includes discrepancy metadata)
logTelemetry({
  orgId,
  agentIdentifier,
  model,
  tokensIn,
  tokensOut,
  costUsd,
  latencyMs,
  policyResult: "ALLOWED",
  metadata: streamDiscrepancy,  // undefined if no discrepancy
}).catch(() => {});

This is a deliberate architectural choice, not a shortcut. The reasons:

Estimation is not authoritative. Our BPE estimate is an approximation, especially for non-OpenAI models. Blocking a request based on an approximation would create false-positive denials — paying customers unable to use the service because our estimate disagreed with the provider’s count. The cure would be worse than the disease.

Latency budget is zero. The proxy sits in the hot path of every AI agent request. Adding even 5ms of synchronous processing to every request compounds across thousands of requests per minute. Token estimation itself is fast (~1ms for typical prompts), but the write to the database is not. Fire-and-forget with .catch(() => {}) ensures the telemetry write cannot slow down or crash the proxy.

Detection is for investigation, not enforcement. The purpose of discrepancy detection is to create an auditable record that a human or automated system reviews periodically. It is observability, not policy enforcement. You review the flags, investigate patterns, and take action at the business level (contacting the provider, adjusting budgets, fixing the streaming parser). You do not auto-block based on token count estimates.

This philosophy — detect and log, never block — is fundamental to building cost integrity monitoring that operators actually trust. A system that blocks real requests based on approximate data gets disabled the first time it causes a false-positive outage. A system that quietly logs anomalies for review stays on forever.


Real-world scenarios this catches

Scenario 1: Tokenizer version change

A provider ships a new tokenizer with a larger vocabulary. Common English words that were previously two tokens become one token. But the per-token price stays the same. Your effective cost per word goes down — except the provider does not advertise this, and some models on the old tokenizer do not benefit. You are paying two different rates for the same quality of output, and you cannot tell from the invoice.

Detection: Discrepancy flags will suddenly decrease for models on the new tokenizer (your estimate overpredicts). A pattern of one-directional discrepancy across model versions is a strong signal of a tokenizer change.

Scenario 2: Streaming usage duplication

A load balancer between your proxy and the provider retries a timed-out streaming request. The provider processes both requests but only returns one response. The usage event reports tokens for both attempts. You are billed for 2x the actual work.

Detection: Output token count from the provider is roughly double your estimate. The relativeDeviation is ~0.5 (right at threshold) to ~0.67 (clear flag). Consistent flags on the same agent during specific time windows point to a networking issue.

Scenario 3: Compromised intermediary

A DNS misconfiguration routes your agent traffic through a third-party proxy that rewrites the usage fields in API responses. The response content is correct (forwarded from the real provider), but the token counts are inflated by 40%. The intermediary bills you for the inflated count, pays the provider for the real count, and pockets the difference.

Detection: Consistent 40%+ deviation across all requests through the affected route. The discrepancy is systematic and one-directional (always over-reporting). Cross-referencing with your provider’s billing dashboard reveals the gap.

Scenario 4: Function calling overhead spike

An agent starts using a new tool with a large JSON schema. The schema is serialized into the prompt by the provider’s API but is not part of the messages array you send. Token count jumps by 800 tokens per request. Legitimate, but expensive and invisible unless you are watching.

Detection: Discrepancy flags appear specifically for requests with tool_choice parameters. The deviation is consistently in the “provider higher” direction. This is a true cost — not overcharging — but it surfaces a cost driver you would otherwise miss.

Scenario 5: Misconfigured caching layer

A cache between your proxy and the provider serves a cached response but reports the original (uncached) usage. You pay full-price tokens for a response that came from cache. Or worse: the cache reports zero tokens, but you used real tokens on the first (uncached) request and the cache never stored the usage correctly.

Detection: Requests with identical prompts show wildly different token counts. Some report zero (cache hit with no usage forwarding), others report full counts. The pattern is visible as bimodal distribution in the discrepancy data.


Database hygiene: periodic cache cleanup

Cost integrity monitoring generates data. Discrepancy metadata lives on action logs. Cache entries accumulate. Without cleanup, storage grows unbounded.

The cleanup strategy uses batched deletion with WAL-friendly patterns:

const BATCH_SIZE = 1000;
const BATCH_PAUSE_MS = 100;

export async function runCacheCleanup(): Promise<number> {
  let totalDeleted = 0;

  while (true) {
    const deleted = await prisma.$executeRaw`
      DELETE FROM cache_entries
      WHERE id IN (
        SELECT id FROM cache_entries
        WHERE "expiresAt" < NOW()
        LIMIT ${BATCH_SIZE}
      )
    `;

    totalDeleted += deleted;
    if (deleted < BATCH_SIZE) break;

    // Brief pause between batches to avoid WAL bloat
    await new Promise((resolve) => setTimeout(resolve, BATCH_PAUSE_MS));
  }

  return totalDeleted;
}

Three things make this production-safe.

Batched deletion. Deleting 100,000 expired rows in a single DELETE statement locks the table for the duration of the transaction. On PostgreSQL, this also generates a single large WAL (Write-Ahead Log) entry that must be replicated to standby nodes. Batching into 1,000-row chunks keeps each transaction small, each lock brief, and each WAL entry manageable.

Inter-batch pause. The 100ms pause between batches gives the WAL writer time to flush, replication time to catch up, and concurrent queries time to acquire locks. Without the pause, back-to-back batch deletes can still overwhelm replication on busy databases.

Hourly schedule. Cleanup runs once per hour. Cache entries have TTLs measured in minutes to hours. Running hourly means the maximum accumulation of expired entries is one hour’s worth — a manageable batch size that completes in seconds, not minutes.

On Neon PostgreSQL (the database backing Govyn), this pattern is particularly important. Neon’s branching architecture means WAL entries are shared across branches. Large single-transaction deletes affect all branches. Batched deletes do not.


Operationalizing cost integrity monitoring

Having discrepancy data is step one. Making it actionable requires three layers.

Layer 1: Dashboard visibility

Surface discrepancy rates on your cost dashboard. The metrics that matter:

  • Discrepancy rate: percentage of requests flagged in the last 24h / 7d / 30d
  • Mean deviation: average relativeDeviation of flagged requests
  • Model breakdown: which models have the highest discrepancy rates
  • Agent breakdown: which agents trigger the most flags
  • Direction: is the provider consistently over-reporting or under-reporting

A discrepancy rate below 1% is normal noise. Between 1-5%, investigate the top offenders. Above 5%, something systemic is wrong.

Layer 2: Alerting

Set alerts on discrepancy rate thresholds:

-- Alert if discrepancy rate exceeds 5% in the last hour
SELECT
  COUNT(*) FILTER (WHERE metadata @> '{"discrepancy": true}') AS flagged,
  COUNT(*) AS total,
  ROUND(
    COUNT(*) FILTER (WHERE metadata @> '{"discrepancy": true}')::numeric
    / NULLIF(COUNT(*), 0) * 100, 2
  ) AS rate
FROM action_logs
WHERE "createdAt" > NOW() - INTERVAL '1 hour'
HAVING COUNT(*) FILTER (WHERE metadata @> '{"discrepancy": true}')::numeric
  / NULLIF(COUNT(*), 0) > 0.05;

Pair this with your existing alerting infrastructure (PagerDuty, Slack, email). An alert on discrepancy rate is more actionable than an alert on absolute cost — it tells you something changed in the provider’s behavior, not just that you are spending more.

Layer 3: Investigation workflow

When a discrepancy alert fires:

  1. Scope it. Is it one model, one agent, or global? Query by model and agentIdentifier.
  2. Direction. Is the provider reporting higher or lower than your estimate? Consistently higher suggests overcharging or overhead tokens. Consistently lower suggests your estimate is wrong (check if you upgraded to a model with a new tokenizer).
  3. Timing. Did it start suddenly or gradually? Sudden onset suggests a provider change, deployment, or infrastructure issue. Gradual increase suggests a shift in prompt patterns that affect tokenizer divergence.
  4. Cross-reference. Compare the flagged period’s total cost against your provider’s billing dashboard. Does their invoice match their API-reported usage? If their own billing disagrees with their API, that is a provider bug. File a support ticket with your discrepancy data as evidence.
  5. Resolve. If it is a real overcharge, dispute it with data. If it is a tokenizer change, update your baseline. If it is an infrastructure issue, fix the root cause (load balancer config, DNS, caching layer).

Cost integrity monitoring


The cost of not doing this

Let us quantify the exposure with real numbers.

Example 1: Small team, single provider

  • 5 agents, 50,000 requests/month
  • Average 2,000 tokens per request (input + output)
  • GPT-4o pricing: $2.50/1M input, $10.00/1M output (blended ~$6.25/1M)
  • Monthly spend: ~$625

A 15% systematic overcount adds $94/month. Over 12 months: $1,125 in undetected overcharging. For a startup watching every dollar, that is a developer’s monthly coffee budget or a meaningful chunk of infrastructure spend.

Example 2: Mid-size operation, multi-provider

  • 40 agents, 500,000 requests/month across OpenAI, Anthropic, and Google
  • Average 3,000 tokens per request
  • Blended pricing: ~$8.00/1M tokens
  • Monthly spend: ~$12,000

A 10% overcount on Anthropic (30% of traffic) plus a 5% overcount on OpenAI (50% of traffic): ($12,000 * 0.30 * 0.10) + ($12,000 * 0.50 * 0.05) = $360 + $300 = $660/month, $7,920/year.

Example 3: Enterprise scale

  • 200+ agents, 5 million requests/month
  • Monthly spend: $120,000
  • A streaming bug affecting 2% of requests that doubles the reported output tokens

2% of 5M = 100,000 affected requests. Average 1,500 output tokens doubled to 3,000. Extra 150M tokens at $10/1M = $1,500/month from a single bug. If the bug persists for 6 months before someone notices: $9,000.

The cost of the monitoring system — a singleton encoder, a JSONB column, a GIN index, and a comparison function — is approximately zero in operational terms. The encoder uses ~4MB of memory. The JSONB column adds a few bytes per row (null for non-flagged requests). The GIN index adds marginal storage. The comparison function runs in ~1ms per request and does not block anything.


Before vs after: cost integrity posture

DimensionWithout verificationWith verification
Token count sourceProvider-reported onlyProvider + independent BPE estimate
Discrepancy detectionNone — discovered on invoiceReal-time, per-request flagging
Time to detect overcharging30+ days (next billing cycle)Minutes (next telemetry write)
Investigation dataNone — “the bill went up”Per-request deviation with model, agent, timestamp
Cross-provider comparisonManual spreadsheetAutomated, queryable, indexed
Latency impactN/AZero (fire-and-forget)
Storage overheadN/A~50 bytes per flagged request (JSONB metadata)
Implementation costN/AOne function, one column, one index
Annual savings potential (mid-size)$0$5,000-$10,000 in caught discrepancies

Key takeaways

  1. Provider-reported token counts are an input to your bill, not a verified fact. Treat them like any other external data: validate at the boundary.

  2. BPE tokenization is deterministic. If you have the rank table, you can reproduce the count. Use o200k_base as a universal baseline — it covers OpenAI exactly and approximates others within a known margin.

  3. The singleton encoder pattern is essential. Initializing a BPE encoder per request wastes 50-100ms each time. Initialize once, reuse forever.

  4. 50% relative deviation is the right threshold. It sits above the worst-case legitimate cross-tokenizer variance (40%) plus overhead tokens (15%), while catching real anomalies (typically 80-300% deviation).

  5. Never block on estimation. Fire-and-forget telemetry with .catch(() => {}) keeps the monitoring system out of the critical path. Detection is for investigation, not enforcement.

  6. JSONB with GIN index scales. The @> containment operator with jsonb_path_ops index class is fast on tables with millions of rows. One column, one index, infinite extensibility.

  7. Batch your cleanup. Deleting expired rows in 1,000-row batches with inter-batch pauses prevents table locks, WAL bloat, and replication lag. Run hourly.


FAQ

Does this work with Anthropic and Google models, not just OpenAI?

Yes. The o200k_base encoding produces an approximation for non-OpenAI models. The approximation is within 10-40% of the real count depending on the text type. This is why the threshold is set at 50% — it absorbs cross-tokenizer variance. You will not get exact matches for Claude or Gemini, but you will catch anomalies where the provider reports double or triple the expected count. For OpenAI models, the match is exact (minus overhead tokens).

Why not use each provider’s native tokenizer for exact counts?

Two reasons. First, most providers do not publish their tokenizer. Anthropic and Google do not provide a public encoding library. Second, running multiple tokenizers (one per provider) increases memory usage, code complexity, and maintenance burden when providers update their encoders. A single estimator with a wide threshold is simpler, cheaper, and catches the same class of anomalies.

Can this detect under-reporting (provider charging less than actual)?

Yes. The detection is bidirectional — it flags any relative deviation above 50%, regardless of direction. Under-reporting is rarer but does occur, typically due to caching layers that serve responses without forwarding usage data. Under-reporting is less likely to trigger a billing dispute, but it indicates a telemetry integrity problem that affects your cost analytics and budget control policies.

How much memory does the singleton encoder use?

The o200k_base rank table is approximately 4MB in memory. It is loaded once on first use and persists for the lifetime of the process. For a proxy server that is already holding open database connections, HTTP client pools, and request buffers, 4MB is negligible.

Should I alert on every discrepancy or aggregate them?

Aggregate. Individual discrepancies are noisy — a single request with unusual Unicode content or an unusually long tool schema can trigger a flag legitimately. Alert on the discrepancy rate (percentage of flagged requests in a time window). A rate above 5% sustained for more than an hour is worth investigating. Use the SQL queries in the operationalizing section as your starting point. Pair discrepancy monitoring with smart model routing and cost reduction strategies for comprehensive AI spend governance.


Govyn is an open-source API proxy for AI agent governance. Usage integrity monitoring ships in v1.2. MIT licensed. Self-host or cloud-hosted.

Monitor your AI costs →

Explore more