AI Agent Observability: What to Trace, Measure, and Alert On

22 min read

A fintech startup’s autonomous research agent started behaving strangely on a Tuesday afternoon. Their Datadog dashboard showed nothing. HTTP success rate: 99.4%. P95 latency: 310ms. No error traces. By Thursday evening, their Anthropic invoice had accumulated $4,300 in unexpected charges. The agent had entered a planning loop, producing reasoning chains that referenced subtasks it had already completed, re-queued them, and executed them again, six or seven times each, because the loop termination condition was a semantic judgment (“is the report complete?”) that the model kept evaluating differently depending on what appeared in the accumulating context window. From the perspective of every classical APM signal, the system was healthy.

This is the defining failure mode of AI agent observability: the infrastructure works, the application logic is wrong, and every metric your SRE team knows how to read says everything is fine. Classic APM was designed for systems where correctness is syntactic. Either the database query returned rows or it threw an exception. Either the HTTP endpoint returned 200 or it returned 500. Agents fail semantically, through outputs that are technically valid but operationally wrong, through flows that are syntactically correct but economically catastrophic, through behaviors that would be flagged instantly by a human reading the trace and are completely invisible to a threshold on a dashboard.

The good news is the signal model exists. It just requires rethinking the three pillars.


Why classic APM fails on agent workloads

OpenTelemetry, Datadog, New Relic, Grafana, Honeycomb: all excellent tools for the workloads they were built for. The problem is not the instrumentation platform. The problem is that agent architectures introduce failure modes that have no analogue in the latency and error-rate vocabulary classic APM speaks.

Four structural differences produce the mismatch.

Non-determinism breaks threshold alerting. A classical service returns a 500 error rate of 0.01% for weeks and then spikes to 0.08% during an incident. The baseline is stable. Static thresholds work because the expected distribution is narrow. An agent’s behavior is a function of its model’s current context window, the state of the tools it has called, and the outputs of previous reasoning steps. The distribution of token consumption, latency, and cost-per-run is wide by construction. An agent run that produces 200 output tokens is not anomalous. An agent run that produces 8,000 output tokens is also not necessarily anomalous. A static threshold at 2,000 tokens will page you constantly on legitimate behavior. Set it at 15,000 and the $800 runaway slips through.

Multi-step tool chains hide partial failures. A REST API call either succeeds or fails. An agent invoking five tools in sequence can have two succeed, one silently return stale data, one succeed with a result the agent misinterprets, and one fail with a structured error the agent treats as content rather than an exception. The aggregate HTTP success rate is 80%, which Datadog will color yellow rather than red. But the behavioral outcome is a wrong answer produced with high apparent confidence and no error trace you can search for.

Prompt-driven flow control produces emergent pathways. In a conventional service, the execution graph is finite and known at deploy time. Branches are code branches. You can trace every path. In an agent system, the execution path through the tool set is determined at runtime by the model’s interpretation of the current conversation state. An agent can call the same tool three times in a row, call it zero times, call it in combination with tools it has never been paired with in testing, or decide to skip it entirely based on something that appeared in a tool’s response two steps ago. You cannot pre-enumerate the paths to instrument them. The trace has to describe what happened, not confirm that what happened was on the expected list.

Cost is a first-class failure mode. In a conventional service, cost is a function of load and can be right-sized once provisioning is correct. In an agent system, cost is a function of token consumption, which is a function of context length, which is a function of how many reasoning steps the model decided to take, which is non-deterministic for every run. A single misconfigured agent calling the wrong model, or an agent that accumulates conversation history without truncation, or an agent that enters a retry storm on a tool that keeps returning ambiguous results can produce a 100x cost event that no latency metric will surface.

The four differences share a root: APM tools measure computational signals. Agents produce semantic signals. The gap between those two vocabularies is where incidents live.


The three-pillar model for agents

The three-pillar model from classical observability applies. Traces, metrics, and logs are still the right abstractions. What changes is the entity model inside each pillar.

Three-Pillar Observability for AI Agents Same pillars as classical APM. Different entity model inside each one. TRACES Span hierarchy: agent_run step tool_call policy_decision model_call Key span attributes: tokens in/out per step tool verdict + reason policy bundle version cumulative cost USD model routing decision cache hit / miss METRICS Time-series counters + gauges: cost per agent / day token delta per step policy deny rate / hour tool error rate by tool name decision latency p50/p95/p99 semantic cache hit rate model fallback rate retry storm depth Why not static thresholds: Non-deterministic baselines require anomaly detection Dashboards: cost/agent/day enforcement actions/hour LOGS Structured event records: policy.verdict + rule name semantic error classification model routing decision reason loop detection events budget breach warnings cache eviction churn tool schema mismatch agent identity + policy version Audit vs. ops logs: Ops: stream, short TTL Audit: durable, tamper-evident Same event source, two output channels. Share schema, split on retention + integrity.

Traces in an agent system describe a tree with a specific span hierarchy: agent run at the root, then steps, then tool calls and policy decisions as children of each step, then model calls as children of tool calls or as direct children of steps when the agent is querying the model directly. Each span in this hierarchy carries attributes that classical spans do not: token counts per step, cumulative cost at each level, policy bundle version at the time of evaluation, the model that actually served the request (which may differ from the model the agent requested if routing fired), and the cache hit/miss decision. The span tree encodes not just what happened and when, but what the agent spent doing it.

Metrics in an agent system are gauges and counters over the entities in that span tree. Cost per agent per day, token delta per step (the change in total context tokens from one step to the next, which is the leading indicator for context runaway), policy enforcement actions per hour, tool error rate broken out by tool name, decision latency at p50/p95/p99 (the time the policy engine spends evaluating each action), semantic cache hit rate, model fallback rate (how often a routing rule redirected a request to a cheaper model), and retry storm depth (the count of consecutive retries on a single tool call). These metrics can feed Datadog, Prometheus, Grafana, or any OTLP-compatible backend. The gap is not the tooling. The gap is that the metrics need to be emitted from the right instrumentation point.

Logs in an agent system are structured event records that explain individual decisions rather than aggregate behavior. Policy verdict with the specific rule name that produced it. Semantic error classification when the agent produced a valid response that was operationally wrong. Model routing decision with the rule that fired. Loop detection events when step count or token growth exceeds a heuristic threshold. Budget breach warnings before the hard stop. Cache eviction churn when semantic cache entries are being invalidated faster than they are being hit. These logs share schema with the compliance audit log described in the audit post, but they serve a different consumer. Ops logs are streamed to short-TTL storage and queried interactively during incidents. Audit logs are durable, tamper-evident, and queried by auditors who need to reconstruct what happened. The same event source feeds both channels. Split them on retention policy and tamper-evidence requirements, not on schema.


The span model: how to represent an agent run in OTel-compatible traces

OpenTelemetry has shipping GenAI semantic conventions that cover the model call span. They do not cover the agent run span or the step span, because OTel conventions stop at the model API boundary and agents are application-level constructs above that boundary. The missing levels need to be instrumented by the runtime layer that can see them, which is either the agent SDK, the orchestration framework, or the governance proxy that sits between the agent and everything it calls.

The span hierarchy to target:

agent_run [root]
  agent.id = "research_agent.v4.2"
  agent.run_id = "run_01HX..."
  agent.total_cost_usd = 0.437         (updated at close)
  agent.step_count = 12
  agent.policy_bundle = "v2.1.4-signed"

  step [child, one per reasoning iteration]
    step.index = 3
    step.token_delta = +1840            (context growth this step)
    step.cumulative_tokens = 14220
    step.cumulative_cost_usd = 0.218

    tool_call [child]
      gen_ai.operation.name = "tool"
      tool.name = "web_search"
      tool.target = "api.search.internal"
      tool.latency_ms = 182
      tool.status = "success"

      policy_decision [child]
        policy.verdict = "allow"
        policy.rule = "internal_search_allowed"
        policy.latency_ms = 1.2

    model_call [child]
      gen_ai.system = "anthropic"
      gen_ai.request.model = "claude-sonnet-4-7"
      gen_ai.response.model = "claude-haiku-4-5"   (routing fired)
      gen_ai.usage.input_tokens = 14220
      gen_ai.usage.output_tokens = 612
      gen_ai.usage.cache_read_tokens = 11800
      gen_ai.routing.rule = "step_gt_3_fallback_haiku"

The step.token_delta attribute is the single most important field for detecting context runaway at trace time. A step that adds 200 tokens is normal. A step that adds 3,000 tokens while the agent is ostensibly completing a previously-started subtask is a signal that something is accumulating in context that should not be. A series of steps each adding 1,500 tokens with no corresponding progress toward the stated goal is the signature of the loop failure described at the opening.

The policy_decision span as a child of tool_call means every tool use carries an observable policy verdict. When an agent is denied access to a tool, the span exists in the trace with policy.verdict = deny and policy.rule = <the rule that fired>. You can search for denial patterns, count them per agent per hour, and surface them in dashboards without needing a separate event stream. The policy engine and the trace are the same data source.

The discrepancy between gen_ai.request.model and gen_ai.response.model surfaces model routing decisions as a first-class observable event. Every time the proxy routes a request to a different model than the agent asked for, the span records both values. This is how you track the effectiveness of your cost routing rules and detect cases where a routing rule is firing unexpectedly on workloads that should not be routed. The proxy versus SDK architecture comparison is relevant here: proxy-layer instrumentation captures both fields because it sees the outbound request and the inbound response. SDK-layer instrumentation typically only sees what the SDK was told to call.


What to alert on

Static thresholds fail on agent workloads because the distribution of normal behavior is too wide to set a useful threshold. The solution is not to abandon alerting. The solution is to alert on rate-of-change signals relative to a rolling baseline, and to alert on structural events that are always anomalous regardless of magnitude.

Rate-of-change alerts that work for agents:

Budget burn rate exceeding 3x the rolling 7-day average for a specific agent is always worth investigating. The absolute dollar amount matters less than the deviation from baseline. A research agent that normally costs $0.40/hour spending $4.20/hour is a signal. The anomaly detection layer computes this baseline automatically; doing it with a static threshold requires setting agent-specific thresholds manually for every deployed agent and updating them whenever usage patterns shift legitimately.

Policy denial spike: a 5x increase in policy denials per hour compared to the previous 24-hour baseline for a specific agent almost always indicates either a new prompt injection attempt, a changed tool description that is triggering a policy rule, or an agent caught in a loop that keeps trying to call a blocked tool. Any of these warrant immediate investigation.

Token explosion: a step-level token delta greater than 5,000 tokens with no corresponding tool call result (meaning the agent is reasoning, not processing new information) is the structural signature of a reasoning loop. Alert on this per step, not per run, because the correct intervention is to stop the run before it exhausts the budget ceiling, not after.

Retry storm: a tool called more than three times consecutively within a single step with no successful result. This is the pattern that generates runaway cost on tool chains that depend on external APIs. The tool keeps failing, the agent keeps retrying, and the context window keeps growing with each round of retry reasoning.

Structural events that are always anomalous:

Model fallback churn: a routing rule that redirects more than 40% of requests in a given hour is probably misconfigured. The routing rules that work well in production fire on 5 to 15% of requests because they are designed for specific conditions, not as the default path. High fallback rates indicate either a routing condition that is too broad or an agent workload that has shifted away from its design parameters.

Semantic cache eviction churn: the rate at which semantic cache entries are being invalidated faster than they are being reused. High churn means the cache is not providing the cost reduction it was designed for and is adding overhead without benefit. This is covered in more depth in the semantic caching post and in the comparison between prompt caching and semantic caching patterns covered in the caching explainer.

Policy bundle version mismatch: an agent running under a policy bundle version different from the currently deployed version. This is always a configuration error or a deployment race condition. It is not a performance anomaly. It is a correctness concern, because the agent may be authorized or denied for actions based on outdated policy rules.


Proxy-native vs. bolted-on OTel: why it matters for signal quality

There are two instrumentation architectures for agent observability: instruments embedded in the agent’s SDK or orchestration framework, and instruments at the proxy layer that intercepts all agent traffic. The difference in signal quality is structural, not incidental.

SDK-layer instrumentation produces spans that describe what the agent code did. The agent called a function named search. The function returned a result. The model was invoked with N tokens. These are the signals the agent code can observe about itself. What SDK-layer instrumentation cannot produce: the policy verdict on each action (because the policy engine runs outside the agent process), the actual model that served the request (because model routing happens at the proxy layer), the cache hit decision (because semantic caching happens at the proxy layer), and any signals about traffic that bypassed the SDK entirely. SDK instrumentation is correct when the SDK is the only path, and incomplete whenever something routes around it.

Proxy-layer instrumentation produces spans that describe every interaction that crossed the network boundary the proxy controls. The spans are emitted by infrastructure the agent does not govern and cannot disable. Policy verdicts, model routing decisions, cache hit rates, cost computations, budget enforcement events, all appear as first-class span attributes because the proxy produces the data at the point of enforcement. The agent does not need to be instrumented. Neither does the orchestration framework. The proxy emits the full span tree for every agent that connects through it.

This is the same architectural argument that explains why proxy-based governance is more reliable than prompt-level guardrails. The enforcement point and the observation point are the same infrastructure element. When an agent is denied a tool call, the policy decision span appears in the trace immediately. When an agent is routed to a cheaper model, the routing decision span appears in the trace immediately. When a semantic cache hit prevents a model call entirely, the trace shows the cache span with no downstream model call span. These are signals that cannot be produced by instrumentation that lives inside the agent, because the events they describe happen outside the agent’s process boundary.

The practical implication for teams that are instrumenting with OTel GenAI conventions today: the GenAI conventions are the right field schema, and starting with SDK-level auto-instrumentation is a valid first step. But the agent-run and step spans, the policy decision spans, and the routing decision spans require either a custom instrumentation layer in your orchestration framework or a proxy that emits them natively. Teams that have wired OTel to their LangChain or LangGraph orchestrator get the model call spans. Teams that have added a proxy-layer governance layer like Govyn get the full hierarchy. The gap between the two instrumentation levels is where the loop detection, the routing observability, and the policy enforcement visibility lives.


Dashboards that matter

The operative question when building an agent observability dashboard is not “what can I show?” but “what decision does this panel enable?” Two panels that look similar can have entirely different operational value depending on whether a person looking at them can take a specific action based on what they see.

Five panels with clear operational value:

Cost per agent per day, as a time-series with a 7-day rolling baseline band. The band shows what normal looks like. A value outside the band is an investigation trigger. This panel alone would have surfaced the Tuesday loop failure described in the opening. The alert condition on this panel should fire at 3x baseline, not at a fixed dollar threshold.

Decision latency at p95, broken out by policy rule. The policy engine should add under two milliseconds to every agent action. If a specific policy rule is adding 50ms at p95, it is probably doing blocking I/O it should not be doing, calling an external enrichment service that is experiencing latency, or evaluating a policy that has grown complex enough to need optimization. Policy enforcement latency is invisible to every other metric because it is a sub-component of tool call latency that classical APM aggregates away.

Tool call success rate by tool name. Not aggregate tool success rate, which hides individual tool failures behind successful calls to other tools, but per-tool success rate as a separate series for each tool your agents use. A specific tool degrading from 98% to 85% success while aggregate tool success stays at 94% is a signal. Per-tool breakdown surfaces it. Aggregate does not.

Policy enforcement actions per hour, broken out by verdict: allow, deny, modify, escalate. The deny rate trending up without a corresponding change in traffic is a signal that something has changed, either a new attack pattern, a changed tool description, or an agent behaving differently than expected. The escalate rate trending up means agents are increasingly encountering decisions that require human review, which may indicate that the policy rules need updating to handle a common scenario without escalation.

Semantic cache hit rate over a 7-day window. Semantic cache hit rate should stabilize once an agent’s workload is predictable. A hit rate that is declining over time indicates query distribution drift: the agent is asking questions that do not match its cached answers, either because its tasks have changed or because something is generating high-entropy queries that resist caching. Cache hit rate is also a direct cost signal: every cache hit replaces a model call, so declining hit rate is declining cost efficiency.


The unified telemetry pipeline

Observability, the policy engine, and the audit log are not three separate systems. They are three consumers of the same event stream. Every agent action produces one event. That event carries the fields for the trace span, the metric counters, the operational log, and the compliance audit record simultaneously. Routing that event to three consumers at write time is dramatically cheaper and more coherent than maintaining three separate instrumentation paths that each capture a partial view of the same action.

The architecture looks like this: the proxy intercepts the agent action, enforces the policy, and emits a single structured event. The event router splits it to three sinks: the trace collector (OTel-compatible, short retention, optimized for search and visualization), the metrics aggregator (counter and gauge updates, used for dashboards and alerting), and the audit log (durable, tamper-evident, optimized for the query patterns auditors run and the retention windows compliance requires). The schema for the event is shared across all three outputs. Adding a field adds it everywhere.

This is the design choice that makes agent observability operationally tractable. Teams that separate their observability stack from their compliance stack end up maintaining two schemas, two ingestion paths, two sets of retention policies, and two sets of queries that describe the same events. When an incident occurs, they are joining across two systems to get a complete picture. When an audit occurs, they are exporting from one system and re-enriching it with data from another. The unified pipeline makes both paths straightforward: query the trace for the incident timeline, query the audit log for the compliance evidence, both pointing to the same underlying events.

The Govyn proxy emits this unified event stream natively. The alternative is to build the enrichment at the SDK layer, which requires custom instrumentation in every agent and every orchestration framework, produces incomplete policy and routing spans, and creates a maintenance burden every time a new agent is deployed. The proxy versus SDK architectural tradeoff in governance applies identically to observability, because governance and observability share the same infrastructure position.


Further reading


Govyn is an open-source AI governance proxy that emits the trace spans, metrics, and event logs this post describes from the proxy layer, without requiring SDK instrumentation in individual agents. We build the infrastructure. The observability model here is based on production deployments and the OTel GenAI conventions, cited inline. We believe the architectural case for proxy-native observability stands on its own merits.

Govyn is open source, MIT licensed. Self-host or cloud-hosted.

Wire up agent observability at the proxy layer

Related posts