AI Agent Cost Attribution: Per-User, Per-Workflow, Per-Tenant Patterns

32 min read

AI cost dashboards usually show one line per provider. The real cost story sits underneath, in the long-tail distribution across users, workflows, and tenants. This post shows how to capture it.


The day finance asked which customer cost us $4,200

A weekly cost review. Anthropic invoice came in at $11,840 for the prior week, up from $7,200 the week before. Finance wanted to know what changed. We pulled up the Anthropic console. The console showed total spend, per-model spend, and one usage curve trending upward. That was it. No customer breakdown, no agent breakdown, no workflow breakdown. The provider has no idea who our users are. The bill has no labels.

We had a hunch and went hunting. Forty minutes of greping application logs later, we found it. One enterprise customer’s research agent had hit a feedback loop on a multi-step planning task and had been retrying the same chain of reasoning steps every fifteen minutes for nine days. It had consumed $4,217 in Claude Opus tokens before anyone noticed. Their seat license was $499 a month.

That single customer was eight times more expensive to serve than they paid us. We had no way to know that without grepping logs, because our cost data lived in the provider invoice and our user data lived in the application database, and nothing tied them together.

The real number that mattered, the long-tail distribution of cost per customer, was invisible to the dashboard. We were running blind on the dimension that matters most for a business: which units of consumption produced which units of revenue. Once we wired up attribution, the picture clarified fast. The top 4% of users drove 51% of token spend. Two workflows out of seventeen accounted for 68% of the bill. One tenant was eating 22% of the entire AI budget.

This is the post about how to make those numbers visible.


What is AI agent cost attribution?

AI agent cost attribution is the practice of assigning every input and output token consumed by an agent system to the specific user, workflow, and tenant that triggered the consumption, with tags that travel from the agent through the proxy and into the billing or analytics warehouse. It converts AI spend from an aggregate provider invoice into a multi-dimensional cost dataset.

Three properties define real attribution.

Every request carries provenance. The user identifier, the workflow identifier, and the tenant identifier are attached to the request before it reaches the model API. The pair “request, cost” is never recorded without the pair “request, who”. An attribution row that says “we spent $0.043 on this call but we do not know who triggered it” is not useful data, it is noise.

Tags survive the provider boundary. The model API receives the request and returns token counts. The attribution metadata does not travel through the provider, it travels around the provider, captured at the proxy or the gateway and joined back to the response. This is what makes attribution provider-agnostic. If you switch from OpenAI to Anthropic next quarter, your attribution dataset stays continuous.

Costs are computable, not just countable. Token counts are not money. The attribution pipeline must apply the price-per-token table for the specific model used, including cache discounts, batch discounts, and any negotiated rates, to produce a dollar figure per request. Without that math, you have observability, not FinOps.

In FinOps Foundation terms, this is the Allocation capability applied to AI. The 2026 State of FinOps report names AI cost allocation as the second-most-cited challenge after visibility. Both rest on the same prerequisite, which is attribution that exists in the first place.


Why does API-key-level cost tracking miss the real distribution?

API-key-level tracking shows you total spend per environment or per service. It hides the distribution of consumption inside that environment, and the distribution is where every interesting business question lives. The long tail of users, workflows, and tenants is the real cost story, and a single aggregate number compresses it into invisibility.

Consider a SaaS product running an AI feature for 12,000 active users. The provider bill shows $48,000 for the month. The dashboard shows the curve climbing 6% week over week. Both numbers are accurate. Both are useless for any decision more specific than “we are spending more on AI than last month”.

Now turn on per-user attribution and look at the actual histogram. The shape is almost never normal. It is almost always Pareto, sometimes more extreme.

Cost per user, 12,000 active users, one month The top 4% of users drive 51% of token spend. The bottom 60% drive 8%. $0 $1 $5 $25 $100 $500+ Monthly token cost per user (log scale) 4,140 3,180 2,220 1,580 490 240 95 38 14 5 2 Bulk users (88%) 38% of cost Power users (8%) 11% of cost Long tail (4%) 51% of cost Number of users

This shape is the rule, not the exception. Across the agent deployments we have instrumented, the consistent pattern is that 4 to 8% of users account for 45 to 60% of token spend, while 50 to 70% of users sit in the bottom 10% of the bill. Per-key tracking averages all of this away into a meaningless mean. The mean user does not exist.

The same shape repeats at the workflow dimension and the tenant dimension. A few workflows are responsible for the majority of cost. A few tenants. A few sessions. Pareto applies recursively. Without attribution you cannot see this, and without seeing it you cannot price, cap, or optimize against it.

There is also a class of problem you can only catch with attribution: the runaway. The 4,217-dollar customer in the opening was not visible at the per-key level because the per-key spend looked normal in aggregate. It was only visible when you sliced by tenant. Production teams have learned this the hard way: 78% of IT leaders surveyed by Zylo in their 2026 AI cost report report unexpected charges from AI consumption-based pricing, and 85% of organizations misestimate AI project costs by more than 10%. The shared cause is that aggregate visibility hides distribution risk.

For a deeper look at where the wasted spend in those long tails actually comes from, our previous post on cutting AI API costs without code changes walked through the routing and budget patterns that address the highest-cost slices once you can identify them.


What are the three attribution dimensions and when does each matter?

The three primary dimensions are user, workflow, and tenant. They are not interchangeable, and each one answers a different category of question. Production attribution captures all three on every request, because the questions you will need to answer in six months are not the questions you are asking today.

Per-user attribution

Per-user attribution assigns cost to the individual end user that triggered the request: the human in the seat, or, in agent-on-agent systems, the originating agent identity. This is what tells you the shape of the per-user histogram. It is the dimension you need for tier capping (free vs. pro vs. enterprise), abuse detection, and any kind of fair-use enforcement.

Per-user attribution matters most when your product has a usage component visible to the user. If your AI feature is bundled into a flat seat license, per-user data is still operationally critical, because the worst-case user defines your unit-economics ceiling. If your AI feature is metered, per-user data is the meter.

The Replit incident we covered in our policy engine post is the canonical case for why per-user attribution must be paired with per-user limits. The agent did not know it had spent the user’s token budget. The user did not know either. Nobody knew until the bill arrived. Per-user attribution converts that invisible state into a counter you can read and gate on.

Per-workflow attribution

Per-workflow attribution assigns cost to the named pipeline or task that the agent was running: customer support triage, code review, document summarization, daily report generation. This is the dimension that tells you which features are profitable and which are not.

A common pattern in production: the marketing-facing AI feature looks expensive, but per-workflow data shows that 80% of its cost comes from a long-context document analysis sub-task that 6% of users invoke. The “feature” is two features fused. One is cheap. One is not. Without workflow attribution, you would scale back the whole feature, when the right move is to throttle or restructure the sub-task.

Workflow attribution is also the dimension that survives feature refactors best. Users churn. Tenants churn. The “research workflow” persists across product redesigns, and a five-year cost dataset on a stable workflow name is more useful than five years on a versioned user base.

Per-tenant attribution

Per-tenant attribution assigns cost to the customer organization, the team, or the subscription account that the user belongs to. This is the dimension finance cares about. It is the dimension that lets you compute customer-level gross margin on AI-powered features.

In multi-tenant SaaS, per-tenant attribution is non-negotiable. The customer that pays you $499 a month and consumes $4,200 in tokens is a different category of problem than the customer that pays $499 and consumes $80. Aggregate cost data cannot tell those two apart. Per-tenant data can, and it is what makes it possible to convert AI spend into product COGS, which is the framing the 2026 FinOps Foundation guidance is now pushing.

The three dimensions compose into a cube, and most production questions are queries against that cube.

The Attribution Cube Every token gets three coordinates. Every business question is a slice. User individual identity tier capping abuse detection Workflow named pipeline feature P&L Tenant customer org chargeback gross margin Slice by user "Which user is the most expensive?" Slice by workflow "Is this feature profitable?" Slice by tenant "What is this customer's COGS?"

The cube is conceptually simple. The implementation is what teams get wrong. The next two sections are the implementation.


What tagging conventions survive provider switches?

Tagging conventions are the schema for attribution metadata. Get them right and your data model holds for years across multiple providers, multiple model versions, and multiple instrumentation rewrites. Get them wrong and you will be migrating attribution data every time you change a provider.

The good news is the schema is mostly solved. The OpenTelemetry GenAI semantic conventions define the standard attributes for AI calls and have stabilized enough through 2026 to be a credible production target. The CNCF community runs them. Datadog announced native support in OTel v1.37, Grafana is collecting LLM traces, and most observability vendors have shipped or are shipping integrations. The schema will outlast any individual provider.

The convention defines the right set of attributes. The pattern below is the minimal subset every production attribution layer should capture, plus three custom attributes that the spec leaves to the implementer.

AttributeSourcePurpose
gen_ai.systemproxyProvider name, “anthropic”, “openai”, “google”
gen_ai.request.modelrequestModel the agent asked for
gen_ai.response.modelresponseModel that actually served (after routing)
gen_ai.usage.input_tokensresponseBilled input tokens
gen_ai.usage.output_tokensresponseBilled output tokens
gen_ai.usage.cache_read_tokensresponseCached input, billed at discount
gen_ai.usage.cache_creation_tokensresponseCache writes, billed at premium
gen_ai.operation.namerequest”chat”, “embeddings”, “completion”
gen_ai.conversation.idheaderStable session identifier
gen_ai.agent.idheaderAgent identity emitting the call
app.user.idheaderEnd-user identity (custom, OTel reserves enduser.id)
app.workflow.idheaderNamed workflow or pipeline (custom)
app.tenant.idheaderCustomer organization (custom)

The three custom attributes are the attribution dimensions. They are not in the OTel GenAI spec because the spec stops at the model boundary, where there is no concept of your application’s user model. They go in the application namespace, prefixed with app. per OTel conventions for non-standard attributes.

These attributes travel as HTTP headers from the agent to the proxy. The proxy reads them off the request, attaches them to the response record, and writes them to the attribution sink. The provider never sees them, which is the point. Tags that travel through the provider get lost when you change providers. Tags that travel around the provider are yours.

The header convention we use:

POST /v1/messages HTTP/1.1
Host: proxy.govyn.local
Authorization: Bearer <agent-proxy-token>
Content-Type: application/json
X-Govyn-Tenant-Id: tenant_a3f8c0
X-Govyn-User-Id: user_18922
X-Govyn-Workflow-Id: research_agent.long_context_summarize
X-Govyn-Conversation-Id: conv_4f7e2c
X-Govyn-Agent-Id: research_agent.v3.1

Five headers, all with the X-Govyn- prefix to avoid conflict with any provider-defined headers. The agent sets them at the entry to the proxy. The proxy validates them, rejects the request if a required tag is missing (tenant_id, user_id, workflow_id), and forwards to the provider with the headers stripped. The headers never leave your network.

The validation is the second-most-important property after schema stability. A request that does not carry attribution metadata is a bug, not a quirk. If you allow untagged requests to flow through, your dataset has gaps, and the gaps will fall on the requests that matter most because nothing focuses an engineer’s attention like a feature shipping under deadline. The proxy refuses untagged calls in production. This is the same default-deny posture that good policy engines apply to tool calls, applied to attribution metadata.

There is one more property worth naming. Tags must be stable across the lifetime of the entity they describe. A user_id that is reassigned when a user upgrades their plan breaks longitudinal cost analysis. A tenant_id that changes when a customer is acquired breaks year-over-year comparisons. The attribution metadata layer should source these IDs from the system of record (your auth service, your billing service, your tenancy service), never from anything that gets recycled.


How do you implement attribution at the proxy layer vs in app code?

There are two places to implement attribution: inside the application, where the agent code runs, or at the proxy layer, where the network egress happens. They are not equivalent. The proxy is where attribution actually works in production; the application is where attribution looks like it is going to work and then quietly leaks.

The argument is the same architectural argument that runs through every other piece of agent infrastructure. The application is in the agent’s controllable surface area. The proxy is not. Attribution at the application layer is correct when the application is correct, and incorrect, missing, or stale when the application has bugs, is bypassed, or is restarted mid-run. Attribution at the proxy layer is correct because the proxy holds the connection.

Attribution Tagging Flow Tags travel around the provider, not through it. Cost is computed at the proxy. Agent request to proxy + X-Govyn headers tenant, user, workflow Govyn Proxy 1. Validate headers deny if tenant or user missing 2. Strip headers, attach key forward clean request to provider 3. Capture response tokens input, output, cache, model used 4. Compute cost, emit row price table x tokens, write to sink Provider Anthropic, OpenAI Google, etc. returns tokens used Attribution Sink warehouse, OTel collector, S3, Kafka per-request rows One row per request, joined and ready to query timestamp | tenant_id | user_id | workflow_id | model | input_tokens | output_tokens | cost_usd 2026-05-12T14:23:01Z | tenant_a3f8c0 | user_18922 | research.summarize | claude-sonnet-4.6 | 4180 | 712 | 0.0231 2026-05-12T14:23:08Z | tenant_44b1ee | user_24506 | support.triage | gpt-4o-mini | 412 | 88 | 0.0011

The proxy approach has six concrete advantages over in-application instrumentation, none of which are theoretical.

Attribution survives library upgrades. When a team upgrades the OpenAI SDK from v4 to v5, every in-application instrumentation hook breaks until somebody updates it. Proxy attribution sits below the SDK at the HTTP layer and is invariant to client library changes.

Attribution survives language switches. A team that runs agents in Python, Node, and Go would otherwise need three independent attribution implementations, three sets of tests, three sets of bugs. The proxy attribution is one implementation that all three languages send their requests through.

Attribution captures the actual model that served. If the proxy is doing model routing (and we covered the case for that in our routing post), the application thinks the request went to Claude Opus, but the proxy actually sent it to Claude Haiku because the routing rule fired. Application-layer attribution would record Opus pricing on a Haiku call. Proxy attribution sees both the request model and the response model and records the right one.

Attribution captures cache hits. Token counts in API responses are not the same as billable tokens. Anthropic’s prompt cache reads are billed at 10% of base. OpenAI’s cached input is half-price. Google’s context caching is 90% off. The math from raw token count to dollar cost depends on which fields the provider reports for that specific request. The proxy sees the canonical response payload and applies the right price table. The application gets a friendly summary that almost always elides the cache fields.

Attribution is bypass-proof. A developer testing in production who hits the provider directly with curl, an agent that imports the unwrapped client to “debug something,” a subagent that inherits the API key and forgets to set headers: all of these paths produce silent gaps in application-layer attribution. The proxy approach forecloses them by holding the only credentials that authenticate against the provider, which is the same architectural property covered in our proxy vs SDK governance post.

Attribution is centralized. A single proxy instance attributes calls from every agent, every team, every product. There is one place to query, one schema to maintain, one cost table to update when providers change pricing (and they change pricing).

The honest counterpoint: application-layer attribution is the right starting place when you do not have a proxy yet. The OpenTelemetry GenAI auto-instrumentation libraries exist, they are easy to drop into a Python app, and they will get you a 70% solution in an afternoon. They are also a stepping stone, not a destination. Once you cross the threshold of needing per-tenant cost data for finance, you will be wiring up a proxy, because the alternative is hand-correcting the gaps every billing cycle.

LayerEffortSurvives library changeCaptures routingCaptures cache pricingBypass-proofBest for
OTel auto-instrumentation in app1 dayNoNoPartialNoSingle-language prototypes
Manual SDK wrapper3 daysNoNoPartialNoTightly controlled monoliths
Sidecar collector1 weekYesLimitedLimitedPartialSingle-cluster deployments
API proxy1 weekYesYesYesYesProduction multi-team

What does a cost-distribution histogram actually look like?

Cost-distribution histograms in production are heavy-tailed, and the tail is where almost every interesting management decision lives. A representative shape, drawn from the agent deployments we have instrumented in 2025 and early 2026:

User percentileCost shareImplication
Top 1%22% to 35%Power users or runaways. Investigate every one.
Top 5%45% to 60%The “expensive cohort.” Tier 3 of any usage policy.
Top 20%75% to 85%Pareto holds. The 80/20 rule is real here.
Bottom 50%5% to 12%Most of your users barely move the bill.
Bottom 10%0% to 1%Inactive or trial users. Nearly free to serve.

The pattern is structural, not coincidental. AI consumption is unbounded above (a user can consume more by asking longer questions or running larger workflows) and bounded below (a user can consume at most zero). Distributions with these properties are heavy-tailed by construction. They look like power laws, log-normals, or stretched exponentials depending on the underlying user behavior, and the tail dominates regardless of which specific shape it is.

Two specific findings that recur often enough to be worth flagging.

Median cost is a useless statistic in isolation. The median user is so far from the mean cost that quoting either number to a non-engineer is misleading. Always quote the histogram, the percentiles, or the Gini coefficient. A team optimizing for “the average user” is optimizing for a phantom.

The expensive users are not always who you would guess. Frequent finding: enterprise customers are often less expensive per-seat than mid-market customers, because enterprise users batch their work into structured workflows while mid-market users explore freely. Free-tier abusers are predictable. The customers your sales team would name as “high value” are sometimes high-margin and sometimes the worst loss leaders, and you cannot tell which is which from the marketing dashboard.

The corollary: pricing decisions made without per-customer attribution data are guesses with confidence intervals wide enough to overlap zero margin. Every one of the SaaS pricing patterns in the Bessemer AI Pricing Playbook presupposes that you know what each customer costs to serve. You do not, until attribution is wired.


How do you act on attribution data?

Attribution data is only valuable if it changes a decision. Three patterns of action are worth implementing in roughly the order below, because each builds on the dataset the previous one establishes.

Tier capping and rate limiting

The first action attribution unlocks: enforce per-user, per-tenant, and per-workflow budget ceilings. The 4,217-dollar customer never happens again, because the proxy refuses calls that would push a tenant past its monthly cap, the same way the proxy refuses tool calls that violate policy.

Caps stack hierarchically. A free-tier user has a $0.50 daily cap. A pro user has a $20 daily cap. A research workflow on the pro tier inherits the pro cap but also has a $5 per-conversation cap to prevent any single session from eating the entire daily allowance. The hierarchy means that a runaway workflow inside a runaway session inside a high-traffic user still gets stopped at the lowest binding cap, because exhausting any layer of the hierarchy is sufficient to deny.

budget:
  defaults:
    per_user_daily_usd: 0.50
  tiers:
    pro:
      per_user_daily_usd: 20.00
      per_conversation_usd: 5.00
    enterprise:
      per_user_daily_usd: 100.00
      per_tenant_monthly_usd: 5000.00
  workflows:
    research_agent.long_context_summarize:
      per_call_usd: 1.50

The pattern matches the budget-control patterns in our policy engine post. Attribution and policy converge here: attribution provides the dimensions, the policy engine provides the enforcement.

Charge-back and show-back

Charge-back means converting attribution data into internal billing rows that move money between teams or business units. Show-back is the lighter-weight version: report the costs to the consuming team, do not actually move the budget. Both depend on per-tenant or per-workflow attribution.

Charge-back becomes table stakes once total AI spend crosses the threshold where it is visible to the CFO. The 2026 State of FinOps report names allocation as the most-prioritized capability across all spend categories, and AI specifically as the spend category where allocation is hardest. Show-back gets you the political cover of “the team that consumes the most can see they consume the most” without requiring the financial system integration. Most production teams ship show-back first and migrate to charge-back when the unit economics are tight enough to require it.

Abuse and runaway detection

Attribution data feeds the alerting layer. Anomaly detection on per-user, per-workflow cost time-series catches the cases that pre-defined caps miss, including:

  • A workflow whose cost-per-call doubles overnight (model regression or routing failure)
  • A user whose daily cost rises 10x compared to their seven-day baseline (compromise, bot, or new use case)
  • A tenant whose total spend trajectory is pacing toward 3x their previous month (pricing review trigger)

The alerts route to operations the same way an agent runaway alert would, with the added context that the source of the anomaly is now identifiable. “Cost spike on tenant 44b1ee, workflow research_agent.long_context_summarize, started at 14:23 UTC, current pace $80/hour vs 7-day average $0.40/hour” is an actionable alert. “AI bill is up” is not.

Pricing and product changes

The strategic action, hardest to do well, is using attribution data to change product behavior. Three patterns recur.

Move expensive workflows to the higher tier. If 80% of the cost of a workflow comes from 20% of the users, gate it. Free-tier users get the cheap version (smaller model, shorter context, fewer turns); paid users get the unconstrained version. Per-workflow attribution is what makes this targeting possible.

Restructure or kill loss-leading features. If a feature ships with negative gross margin and the per-feature attribution data shows no path to positive, the feature is a leak. The decision to restructure it (with caching, smaller models, or output gating) or kill it depends on whether the feature drives retention or referrals that the cost data cannot see, and that is now an honest argument with finance instead of a guess.

Price for the cost shape. Flat-rate pricing on top of usage-driven cost is a slow bleed. Attribution data is the input to pricing models that match cost (volume tiers, output-token charges, fair-use ceilings). The transition is rarely fast, and it is impossible without the underlying data.

The order matters. Caps stop the bleeding. Show-back creates organizational visibility. Charge-back creates organizational accountability. Pricing changes happen last, because pricing changes only land if the data behind them is unimpeachable, and “unimpeachable” requires that the attribution dataset has been running, instrumented, and trusted for at least a quarter. Skip the early steps and finance will not believe the data when you bring it.


Key takeaways

  1. API-key-level cost tracking hides the long tail. The mean user does not exist. 4 to 8% of users typically drive 45 to 60% of cost; aggregate data cannot see this.

  2. The three dimensions of attribution are user, workflow, and tenant. Each answers a different category of question. Capture all three on every request.

  3. Tag with the OpenTelemetry GenAI semantic conventions plus three custom attributes. app.user.id, app.workflow.id, app.tenant.id. Headers travel from agent to proxy. Schema survives provider switches.

  4. Implement attribution at the proxy layer. Application-layer attribution is a starting point, not a destination. The proxy approach captures routing, cache pricing, multi-language traffic, and bypass attempts.

  5. Act in order. Caps first, show-back second, charge-back third, pricing changes last. Each step requires the dataset the previous step builds.


FAQ

How is AI cost attribution different from cloud cost allocation?

Cloud cost allocation works by tagging the resource at provisioning time. AWS instances carry tags. The bill arrives with the tags. The allocation is mechanical. AI cost allocation has no resource at the point of consumption: the LLM call is stateless, the API key is shared across users, and the provider invoice does not carry your tagging metadata. The work that the cloud allocation does at the resource layer has to be done at the request layer for AI, which is why the FinOps Foundation now treats AI as a distinct technology category rather than a sub-case of cloud. Same goal (every dollar tied to a consumer), different mechanics.

Do I need a proxy to do attribution?

Not technically. You can instrument every agent in the application layer with OpenTelemetry GenAI auto-instrumentation, or you can wrap the SDK with custom code that emits attribution events. Both work in narrow scenarios. Both leak under any of the failure modes covered earlier: routing changes, cache pricing, multi-language deployments, library upgrades, agents that import the unwrapped client. The proxy is the layer at which attribution becomes structural rather than behavioral, and any team running agents in production for revenue-bearing workloads will end up at a proxy regardless of where they start.

What about provider-side cost APIs like the OpenAI usage dashboard?

The provider-side dashboards show you total spend on that provider, often with a one-day lag, sometimes with a per-API-key breakdown. They cannot see your users, your workflows, or your tenants because that information was never sent to the provider. They cannot give you a unified view across providers because each provider’s dashboard is an island. They can be useful as a sanity check (compare the provider dashboard’s total to your attribution sink’s total at the end of the month), but they cannot be your primary attribution mechanism.

How do I price-compute cost from token counts?

Maintain a price table per (provider, model, token type) and apply it at attribution-write time, not at query time. Token types include input, output, cached input read, cache creation (Anthropic charges a premium), batch input, and batch output. Prices change: Anthropic Sonnet 4.6 is currently $3 per million input and $15 per million output, OpenAI GPT-5 is $1.25 per million input and $10 per million output, Google Gemini 2.5 Pro is $1.25 per million input and $10 per million output for context under 200K. The price table needs to be a versioned object, indexed by date, so that historical attribution rows compute against the price the request actually paid, not the current price.

How granular should the workflow ID be?

Granular enough that two workflows with materially different cost profiles get different IDs, coarse enough that you do not generate one ID per micro-task. A useful test: if your workflow ID space changes more than once a quarter, it is too granular. If two technically-different operations are indistinguishable in the data, it is too coarse. The pattern that tends to work is <product_area>.<operation>, optionally with a <version> suffix when the operation has been materially redesigned. Keep the namespace under 200 distinct values for any reasonable product surface; if you are over that, you are probably mistaking sub-tasks for workflows.


Further reading


Govyn is an open-source API proxy for AI agent governance. Cost attribution, model routing, semantic caching, and policy enforcement. MIT licensed. Self-host or cloud-hosted.

Attribute every token to a user, workflow, and tenant.

Related posts