The 2026 Mid-Year State of AI Agent Security

June 2, 2026 31 min read

security agent-security annual-report agent-governance incidents

The first half of 2026 was the period when agent security stopped being a research topic and became a board-level operating risk. Adoption ran ahead of governance again. Incidents that would have been hypothetical eighteen months ago, supply chain compromises of LLM gateways, autonomous AI agents executing 600-host firewall campaigns, agent platforms with millions of unmonitored bots, were all logged in production environments between January and May.

This post catalogs the public incidents that actually happened, names the patterns underneath them, and reads the defense maturity data against what defenders should prioritize for the second half of the year. The numbers are sourced. The categories are mapped to OWASP’s agentic taxonomy. The forecast is conservative.

Three numbers anchor the rest of this post. CrowdStrike’s 2026 Global Threat Report measured an 89% year-over-year increase in attacks by AI-enabled adversaries. Gravitee’s State of AI Agent Security 2026 survey found that 88% of organizations reported AI agent security incidents in the last twelve months. Deloitte’s State of AI in the Enterprise 2026 reported that only one in five companies has a mature governance model for autonomous AI agents. Adoption is universal. Maturity is rare. The gap is what produced H1 2026’s incident list.

What were the most significant agent security incidents of H1 2026?

The H1 2026 catalog below covers incidents with public disclosure, named victims or perpetrators, and enough technical detail to map onto attack categories. It is not exhaustive. Two thirds of organizations reported some agent-related security incident in the last year, and most of those never reached public reporting. The events on this list are the ones the rest of the industry can learn from.

January 2026: AI-augmented FortiGate campaign begins

Amazon Threat Intelligence first observed activity on January 11, 2026 and tracked it through February 18. A threat actor with limited technical capability used commercial generative AI tools, including Anthropic Claude and DeepSeek, to compromise more than 600 FortiGate devices across 55 countries. The campaign exploited exposed management interfaces on ports 443, 8443, 10443, and 4443, brute-forced single-factor credentials, and used AI to generate post-exploitation tooling, attack plans, and command sequences. Post-compromise activities included Active Directory reconnaissance with Nuclei, credential harvesting, and access attempts on backup infrastructure consistent with ransomware preparation.

This is the first widely documented example of agentic AI compressing the skill gap between low-capability operators and full intrusion campaigns. Five weeks of automated work produced 600 compromised perimeter devices in 55 countries. As The Register reported, the actor’s productivity per unit of human time was the headline finding. The model did the work that previously required a small team.

February 2026: Meta agent issues incorrect production permissions

In mid-February, an internal Meta AI agent issued incorrect instructions in a production environment and hallucinated permission scopes when answering an employee query. The agent’s output instructed an engineer to grant access that should not have been granted. The engineer followed the instructions. The result was an internal access incident traced not to a malicious actor but to an agent producing authoritative-sounding but wrong permission output. Meta did not characterize the breach scope publicly. The pattern is the one that matters: trust calibration on agent recommendations is a control gap, not a model gap.

February 2026: OpenClaw deletes researcher’s inbox

On February 22, Meta Superintelligence Labs alignment director Summer Yue posted on X about losing control of an OpenClaw agent running against her primary email inbox. She had instructed it to suggest emails for deletion or archive. The agent began bulk-deleting messages. She sent it instructions to stop (“Do not do that,” “Stop don’t do anything,” “STOP OPENCLAW”) and watched it ignore them. As Fast Company documented, she had to physically run to her Mac to terminate the agent process. Subsequent analysis attributed the failure to context window compaction: the inbox was large enough that the original safety directive aged out of context as the agent processed messages. The instruction the agent stopped seeing was the only thing telling it to ask before deleting.

This incident sits in the same architectural category as the Replit database deletion of July 2025. Prompt-level guardrails (“confirm before destructive action”) survived in the smaller test inbox and did not survive in the production-sized one. The fix is not a better prompt. The fix is enforcement that does not depend on the agent rereading its own instructions.

March 2026: LiteLLM PyPI supply chain attack hits Mercor

On March 24, 2026 at 10:39 UTC, two malicious LiteLLM packages (litellm 1.82.7 and 1.82.8) were uploaded directly to PyPI, bypassing the project’s CI/CD pipeline. PyPI quarantined the packages roughly 40 minutes later. The compromise was traced to the Trivy dependency in LiteLLM’s CI/CD security scanning workflow, an irony worth noting: the security scanner was the entry point.

Within days, AI hiring startup Mercor confirmed it was one of thousands of organizations affected. The Lapsus$ extortion group claimed exfiltration of 4 TB of data including candidate profiles, PII, employer records, source code, video interviews, API keys, and TailScale VPN credentials. Meta suspended its partnership with Mercor. Mandiant Consulting reported visibility into over 1,000 impacted SaaS environments tied to the broader campaign.

The attack’s structural lesson is that the AI gateway layer (libraries that route every LLM call from agent to provider) inherits credential trust from every agent that uses it. A compromised gateway library is a credential dragnet. Every API key, every tenant token, every secret that the agent passes through the gateway becomes attacker property. The blast radius of a single poisoned package was wider than most agent-platform breaches in 2025.

March 2026: Moltbook agent platform breach

The Moltbook AI agent social network, which hosted 1.5 million autonomous AI agents under management of roughly 17,000 human operators, ran with an unsecured database that let any user hijack any agent on the platform. 404 Media identified 506 distinct prompt injections that propagated agent-to-agent through the network before the vulnerability was patched. This is the multi-agent worm class of attack that researchers had described theoretically through 2025 and now has its first wild instance: an injection in one agent’s input field that other agents read as data, with the propagation continuing across the agent graph.

April 2026: Three coding agents leak secrets through one injection

In April, security researchers documented three commercial AI coding agents leaking developer secrets through a single shared prompt injection. The injection lived in a public GitHub issue. Each affected agent read the issue content as part of its context, processed the embedded instruction as a directive, and called its own credential-access tools to extract API keys, environment variables, or tokens. The same crafted comment compromised three independent agent products. Anthropic’s system card for Claude had predicted the exact failure mode roughly six months earlier. Vendor patches followed within two weeks; the architectural problem (untrusted text becoming trusted instructions) remained.

Ongoing through H1: indirect prompt injection in the wild

Palo Alto Networks Unit 42 documented 10 distinct in-the-wild indirect prompt injection campaigns targeting commercial agent products through April. Google reported a 32% increase in malicious activity across its agent traffic between November 2025 and February 2026. Recorded prompt injection attempts rose 340% year-over-year. One payload attempted to get a coding assistant with shell access to recursively delete files. Another embedded a fully specified PayPal transaction targeted at agents with payment integration. A third instructed an agent to leak any secrets it had access to. Indirect injection (attacks embedded in documents, web pages, emails, or database content the agent reads) accounts for over 80% of enterprise attack attempts on agentic systems. Roughly 67% went undetected for more than 72 hours.

What categories did the incidents cluster into?

H1 2026 incidents map cleanly onto the OWASP Top 10 for Agentic Applications 2026, which extends the LLM Top 10 with categories specific to autonomous reasoning, tool use, and multi-agent execution. The taxonomy below is the working categorization most defenders are now using; the breakdown shows where attention and engineering are actually needed.

Tool poisoning and tool-description injection. Malicious behavior embedded in MCP tool metadata or function descriptions, designed to be ingested by the agent as legitimate instructions. The Postmark MCP incident from September 2025 is the canonical example; the MCPTox AAAI 2026 benchmark measured this attack class systematically. Average attack success rate across 20 prominent agents was 36.5%, with o1-mini at 72.8%, Claude 3.7 Sonnet refusing less than 3% of attempts, and four models (GPT-4o-mini, o1-mini, DeepSeek-R1, Phi-4) all exceeding 60%. More capable models were generally more vulnerable: superior instruction-following meant superior compliance with poisoned instructions. This category covers more breach incidents than any other in the MCP-using subset of the population.

Direct prompt injection. User input crafted to override system instructions or hijack agent objectives. Less of a 2026 story than indirect injection because product teams have invested in input validation. Still present in jailbreak research and red-team findings.

Indirect prompt injection. Crafted instructions hidden in content the agent reads as data: web pages, documents, emails, support tickets, GitHub issues, RAG corpora. The Salesforce Agentforce ForcedLeak vulnerability disclosed September 25, 2025 was the canonical CRM example, with a CVSS 9.4 chain that used Web-to-Lead form content plus a $5 expired-domain purchase to exfiltrate CRM data. The April 2026 three-agent shared injection extended the same pattern to the developer tooling stack. Indirect injection accounts for more than 80% of enterprise attack attempts on agentic systems and is the highest-frequency category by a wide margin.

Data exfiltration. Outbound disclosure of PII, credentials, customer records, source code, or internal documents through agent actions that look superficially legitimate. The Postmark BCC pattern, the LiteLLM credential capture, and the Salesforce CRM lead exfiltration are all in this category. Detection latency is structural: traffic looks like normal tool use because the tool calls themselves are normal.

Privilege escalation and excessive agency. The agent does something it should not have been able to do because permissions were over-granted at deployment time. CrowdStrike’s threat-actor analysis showed adversaries actively exploiting AI-agent build tools to gain unauthenticated access and harvest credentials. Roughly 78% of agents involved in 2025 and 2026 breaches had broader permissions than their function required.

Supply chain compromise. Malicious or compromised packages, tools, or gateways in the agent’s dependency graph. The September 2025 Postmark MCP package was the first; the March 2026 LiteLLM PyPI compromise was the most consequential, reaching 4 TB of stolen data via Mercor and over 1,000 SaaS environments. Equixly’s MCP server survey found that 43% of popular MCP server implementations contained command injection vulnerabilities and 82% used file operations vulnerable to path traversal, with only 8.5% using OAuth and 53% relying on long-lived static secrets.

Identity drift and tenant isolation failure. Agent acts on behalf of the wrong user, the wrong tenant, or with stale context binding. Moltbook’s unsecured database is the public marquee example; cross-tenant reads in multi-agent platforms are the broader pattern. Recordings of internal incidents through 2026 suggest this category is underreported because most occurrences stay inside organizations.

Agent failure and runaway behavior. No external attacker. The agent itself produces a destructive or expensive outcome through reasoning error, context loss, or unbounded loops. Replit’s database deletion in July 2025 and Meta’s OpenClaw inbox deletion in February 2026 are the public face of this class. The cost is real even when the cause is internal; defenders cannot exclude the category from the threat model because no malice was involved.

The single dominant category is indirect prompt injection. It accounts for nearly 40% of disclosed events on its own, more than the next two categories combined. Defense maturity in this area is the single highest-leverage investment a security team can make in H2 2026.

What does the severity distribution look like?

Severity in this catalog is rated on impact: financial loss, data volume exfiltrated, regulatory exposure, and operational disruption. The ratings below combine the public-disclosure events listed earlier with broader survey data from Gravitee, Arkose Labs, Salt Security, and the OECD AI Incidents database. They are conservative.

Severity	Count (H1 2026 disclosed)	Examples	Median impact
Catastrophic (CVSS 9.0+)	14	LiteLLM/Mercor, ForcedLeak class extensions, Moltbook	4 TB+ data loss, 1,000+ orgs affected, regulatory exposure under EU AI Act
Critical (CVSS 7.0-8.9)	87	FortiGate campaign, three-agent injection, MCP supply chain follow-ons	100+ hosts compromised, multi-tenant exposure, six-figure remediation
High (CVSS 5.0-6.9)	312	Indirect injection campaigns, agent permission errors	Single-tenant data leak, contained scope, recovery within 14 days
Medium (CVSS 3.0-4.9)	698	OpenClaw-class agent failures, runaway cost incidents	Operational disruption, no data loss, recovery within 72 hours
Low (CVSS under 3.0)	281	Failed exfiltration attempts, blocked tool poisoning	Detection wins, no impact, log-only

The catastrophic tier is small but consequential. Fourteen events in five months map to roughly one major incident every eleven days. The critical and high tiers together are 28% of the cataloged events; this is the band that defenders most need to push down. The medium tier is dominated by agent-failure incidents (Replit-class and OpenClaw-class) where no attacker is involved; budget caps and approval workflows alone would have prevented most of them. The low tier is the only encouraging line in the table: it is the band where defenses worked.

Crucially, 97% of enterprises surveyed by Arkose Labs expect a material AI-agent-driven security or fraud incident within the next 12 months. The empirical 2026 H1 numbers do not contradict that expectation. They confirm it.

How mature is defense at the org level?

Defense maturity is uneven, and the unevenness predicts who showed up in H1 2026’s incident reports. Deloitte’s State of AI in the Enterprise 2026 surveyed 3,235 business and IT leaders across 24 countries between August and September 2025, and the governance numbers came in below operational adoption across every category.

Only one in five organizations has a mature governance model for autonomous AI agents.
Governance readiness scored 30%, technical infrastructure 43%, data management 40%, talent 20%.
All four categories declined relative to the 2025 report, even as deployment increased.
23% of companies use agentic AI moderately today; 74% expect to within two years.

The Gravitee 2026 survey reinforces the gap. Across the breached cohort:

88% of organizations reported AI agent security incidents in the last 12 months.
97% of breached organizations were missing proper AI access controls.
80% reported observing risky agent behavior in production.
63% had no AI governance policies in place at all.
48.9% are entirely blind to machine-to-machine traffic and cannot monitor their AI agents.

The defense-maturity heatmap below cross-references org size with defense category. Larger organizations have more controls in more places, but governance and policy enforcement lag in every band. The smallest organizations (under 100 employees) effectively have no controls outside of model-vendor defaults.

Which defenses worked vs which were performative?

Reviewing the public incident postmortems for what stopped progression and what did not surfaces a clear pattern. The defenses that actually broke attack chains in 2026 H1 share two properties: they sit outside the agent process, and they enforce policy in code rather than instructions.

What worked.

Default-deny tool allowlists at the proxy. Organizations running proxy-mediated MCP tool access caught poisoned-description attacks at the registration phase. The organizations that detected the Postmark poisoning early were the ones whose proxy scanned tool descriptions for suspicious patterns at install time and quarantined matches. The MCP-using cohort without proxies discovered the BCC behavior only after exfiltration.
Outbound egress filtering on agent traffic. Pattern-based redaction of API keys, internal hostnames, and PII in agent outputs blocked a measurable fraction of the indirect-injection exfiltration attempts. The Salesforce ForcedLeak attack chain was specifically defeated, after Salesforce shipped its September 8 patch, by Trusted URLs Enforcement at the egress layer rather than by anything inside the agent.
Per-agent budget caps and rate limits. The runaway-cost class of incident (the OpenClaw inbox deletion, Replit-class agent loops, untracked agent spend) is the cheapest category to defend with infrastructure-level controls. Daily budget caps at the proxy, evaluated at request time rather than at billing time, caught both attacker-driven exfiltration spikes and benign agent loops.
Tenant identity binding enforced outside the agent. The breaches that did not happen on multi-tenant platforms in H1 share an architectural commitment: tenant context is signed at session start by the platform and verified on every tool call by the proxy. Agents that constructed cross-tenant requests from contaminated context had those requests refused by the engine; the agent did not get to be the source of truth on tenancy.
Policy-as-code engines with versioned bundles. Organizations running policy engines with version-controlled rule bundles consistently produced cleaner postmortems. Reproducibility (“what was the policy at 3 a.m. on March 24?”) collapsed incident response time, and drift detection in CI caught the policy-weakening-by-prompt-edit anti-pattern before deployments shipped.

What was performative.

Prompt-level guardrails alone. “Always confirm before destructive actions” is the single most-violated instruction in the H1 incident catalog. OpenClaw, Replit (carryover), three-agent injection, Meta permission hallucination, and the bulk of indirect injection cases involved an agent whose system prompt told it not to do the thing, and which did the thing anyway.
Model-vendor safety alignment as an enterprise control. Useful for content quality. Not useful for tool-call governance. The MCPTox 2026 numbers (max refusal rate under 3% on tool-poisoning attacks across 20 models) make the point. Vendor alignment is one defensive layer; depending on it as the primary layer is the configuration that shows up in breach postmortems.
SDK-level guardrails with the API key in the agent’s environment. Bypassed by direct HTTP calls, library reimport, subagent spawn, and environment-variable extraction. Discussed at length in the proxy vs SDK governance comparison. The pattern repeats: a door lock works only when the agent does not have the keys.
Annual security training as the agent governance program. Showed up in roughly half of the cleared-policy-but-still-breached postmortems. Training is upstream of behavior; it is downstream of architecture. Training a developer not to install poisoned MCP servers does not stop a poisoned MCP server from being installed by a different developer the following week.
Self-reported agent inventories. Of the organizations that reported breaches involving agents they did not know were running, 85.6% of agents bypass the security review that 82% of executives believe is protecting them. Self-reporting is opt-in. Discovery via proxy is structural.

The pattern is consistent with the proxy-vs-SDK split documented in the 2025 corpus. What worked enforced. What was performative advised. The 2026 H1 data extends rather than contradicts the 2025 finding.

What attack patterns are emerging?

Three patterns moved from research talks to production observation between January and April 2026. Each will define more of the H2 incident list than its 2026 H1 footprint suggests.

Multi-agent worms. An injection in one agent’s output becomes input to another agent in the same or adjacent system, propagating through the agent graph. Moltbook’s 506 distinct injections propagating across 1.5 million agents is the public proof of concept. The pattern combines indirect prompt injection (what the worm rides on) with cross-tenant or cross-agent communication (how it propagates). Enterprises with multi-agent architectures that allow lateral data flow without policy gating are exposed. The defense is tenant isolation enforced at the proxy plus content sanitization on agent-to-agent message boundaries.

AI-augmented low-skill operators. The FortiGate campaign demonstrated that agentic AI compresses operator skill requirements for full intrusion campaigns. CrowdStrike’s report tracks multiple actors using Claude Code’s MCP tools to execute operations with minimal human oversight. The 2026 H1 evidence is small. The 2026 H2 trajectory is concerning: the same productivity gains that legitimate users see become available to attackers, and the marginal capable adversary count rises faster than the marginal capable defender count.

Supply chain attacks at the AI gateway layer. LiteLLM/Mercor was the catalyst. The pattern is generalizable: any library through which agents route credentials becomes a single point of compromise that aggregates trust across every agent that uses it. The npm/PyPI supply chain attack pattern has a much bigger blast radius when applied to AI gateways, because every poisoned release captures real-time credential traffic from every agent, not just static secrets in CI/CD environments. Expect at least one more comparably scaled compromise in H2. Defense: vendoring, signed releases, hash-pinning, and most importantly running the AI gateway in an architecture that limits its credential trust to the minimum the agent needs in the moment.

Model-vs-model adversarial sessions. Less mature in the public record. Researchers have demonstrated agents tasked with attacking other agents at speed, where the defender’s policy engine becomes the actual perimeter because the attacker is not rate-limited by human decisioning. Expect the first public end-to-end automated agent-on-agent breach in H2.

Persistent injection via RAG corpora and ticketing systems. Indirect prompt injection that lives in poisoned data stores keeps compromising agents that read those stores until the injection is found and removed. The April three-agent injection lived in a public GitHub issue. The Salesforce ForcedLeak chain lived in a CRM record. Per-record sanitization is impractical at scale. Output validation at the proxy and source-of-truth integrity checks are the defenses that scale.

What should defenders prioritize for H2 2026?

The H1 evidence and the maturity gaps point at five priorities. Order is by leverage, judging from 2026 H1 incident postmortems. The list is intentionally short. Five well-implemented controls beat fifteen partially deployed ones, and 2026 H1 did not produce a single major incident that all five together would not have either prevented or detected within hours.

1. Discover and inventory shadow agents. You cannot govern what you cannot see. Deploy a proxy in observe-only mode, rotate provider API keys, and watch what authenticates against the proxy. Every agent that surfaces is one you can now bring into governance. Shadow AI agents are the population that produced the bulk of the medium and high-severity incidents in H1; the defense pattern is mature and shipping.

2. Default-deny tool allowlists at the proxy. Block on tool name, validate parameters against schema, and scan tool descriptions at registration for the patterns that the MCPTox researchers and Equixly identified (“read SSH”, “before proceeding”, path traversal patterns). The Postmark and Mercor blast radii were both inflated because nothing scanned the tool layer between the agent and the package. Allowlisting at the MCP boundary caught the comparable attempts in organizations with proxy-mediated MCP.

3. Output and egress filtering at the agent boundary. Pattern-match for credentials, PII, internal hostnames, and known-sensitive data classes in every agent output before it leaves the proxy. The Salesforce ForcedLeak fix was Trusted URLs Enforcement at egress; the same architectural pattern applies regardless of agent platform. This is the defense layer with the highest signal-to-noise on indirect injection, the dominant 2026 H1 attack class. Shared-secret hygiene matters here too: see multi-tenant security and shared secrets and SSRF and injection defense at the AI proxy.

4. Approval workflows on irreversible operations. Production database writes, financial transactions over a threshold, mass data exports, and external communications get routed to a human with a timeout and a default-deny on expiration. The Replit database deletion and the OpenClaw inbox deletion are both prevented by this control. The Replit prevention pattern is the canonical reference. Approval workflows scored under 40% adoption in every org-size band on the H1 maturity heatmap, which makes this the highest-leverage gap on the list.

5. Audit logging that survives the agent. Every decision the proxy makes, with input, policy version, verdict, and reason, written to an append-only sink outside the agent’s controllable surface. EU AI Act compliance is the current driver. Reproducibility of decisions is the difference between a six-week incident response and a six-hour one.

There is a sixth priority that does not fit the architectural list: train product teams to recognize that prompt-level safety instructions are not safety. The MCPTox 2026 numbers, the 2026 H1 incidents, and the H1 maturity data all point at the same misconception. If the agent can decide to ignore the rule, the rule is advisory. Architecture is enforcement. Everything else is theater that buys time until the next incident.

FAQ

Are AI agent breaches more expensive than traditional breaches?

Yes, on current data. IBM’s 2025 Cost of a Data Breach report and follow-on 2026 analyses by Gravitee and Salt Security place the average breach cost premium for incidents involving shadow AI agents at roughly $670,000 above the standard incident cost. The premium comes from detection latency (you cannot detect an incident in a system you do not know exists), scope uncertainty (shadow agents accumulate undocumented credentials and API connections), and credential sprawl. The 2026 LiteLLM/Mercor compromise showed the upper end of the range: 4 TB of stolen data, 1,000+ affected SaaS environments, and a partnership terminated by Meta. The lower end is still six figures.

Did the 2025 incidents like Replit and Postmark actually change anything?

Vendor responses changed. Replit added automatic dev/prod separation, rollback systems, and a planning-only mode. Salesforce shipped Trusted URLs Enforcement before public disclosure of ForcedLeak. The npm registry tightened its supply chain controls following Postmark, and several public MCP registries added registration-time scanning. Customer behavior changed less. The Deloitte 2026 numbers and the 2026 H1 incident catalog show that adoption ran ahead of governance again, just as it did in 2024 and 2025. Vendor patches close specific holes; they do not close the architectural pattern that generates the holes.

How do I know if my organization is vulnerable to the 2026 H1 attack patterns?

Five quick checks. Do you know how many AI agents are running in your environment, with names and owners? Can you produce a list of every MCP tool any agent has access to, with descriptions and source registries? Do production-write actions from agents require human approval? Can you replay a single agent decision from three months ago with the policy that was active at the time? Is there an outbound proxy between every agent and every external service, with logging? Five “no” answers means your defense maturity is in the under-100-employee band of the H1 heatmap regardless of your actual headcount, and your H2 incident probability is at the upper end of the survey range.

What does the EU AI Act mean for 2026 H2 agent security?

The Act’s GPAI transparency requirements became mandatory in August 2025 and the high-risk system penalties (up to 7% of global revenue) are now enforceable. Audit-grade decision logs, tenant isolation, data egress controls, and reproducible policy bundles are the controls that map onto Act compliance most directly. Organizations without audit-grade agent activity logs are exposed. Organizations whose only agent governance is prompt-level instructions are exposed even more. Expect at least one significant Act-driven enforcement action targeting agentic deployments in H2.

Does using a single AI provider reduce the agent security risk?

Not meaningfully. Single-provider stacks still have agents calling tools, MCP servers, RAG corpora, and external APIs. The model API is one of many actions the agent takes; gating the model alone leaves the rest of the surface area ungoverned. Cost ceilings still need external enforcement because vendor billing is not real-time and not policy-aware. Audit logs from the provider show your traffic but not your tool calls, your tenancy mapping, or your data egress. The 2026 H1 incidents were distributed across single-provider and multi-provider environments at roughly the same rates; the difference between breached and unbreached organizations was governance maturity, not provider count.