Skip to content
10 May 2026

The OWASP Top 10 for AI Agents, in Plain English

The OWASP Top 10 for AI Agents, in Plain English

In December, OWASP published their first Top 10 for Agentic Applications. Thanks to John Sotiropoulos - the OWASP GenAI Security Project Board Member who chaired it - I had the chance to take part in the public review, and was acknowledged in the document itself alongside some friends at Kainos.

The document runs to fifty-odd pages of concise, practical, and actionable guidance, and is an excellent reference for security leaders & practitioners. For founders or business leaders who are diving into the world of Cowork (whether Copilot or Claude) with plugins, connectors, skills and all that good stuff - it's important to be aware of these threats too, so here's the plain English version of everything I think you should know about.

The challenge with agentic security

Agents aren't like deterministic traditional software applications. They can plan, utilise tools, request data, write files, and act on behalf of users across multiple steps. That changes the security picture in three ways.

1) Anything the agent reads is an input. Documents, emails, calendar invites, webpages, search results, other agents' replies. The agent can't reliably tell instructions from content, which means any line of text it processes (hidden or otherwise) can become a command.

As an example, somebody recently manipulated Bankrbot, an automated trading bot, to perform the unauthorised transfer of approximately $150K in cryptocurrency by asking the bot to do it in morse code.

2) The tools you give an agent are an attack surface. For anything you let the agent do, ask: what happens if it does it wrong? An agent with read-only access to a database has a different kind of gone-wrong potential than one with full write access.

PocketOS, which provides software for car rental businesses learned this the hard way when an AI agent deleted their entire production database and all its backups in under 10 seconds.

3) The agent acts on its own. It has its own logins. It hands work off to other agents. It runs at 3am. Most identity and audit systems weren't built with that in mind.

Which itself raises a question about accountability for when things go wrong. To quote an internal IBM training manual from 1979: "A computer can never be held accountable, therefore a computer must never make a management decision"

OWASP introduces the principle of Least-Agency. Don't give the agent autonomy it doesn't need. Don't connect tools you can't justify. Don't run continuously what could run on demand.

With all this in mind - here's the Top Ten for AI Agents, in plain English.

1. Goal Hijack

Anything the agent reads (webpages, email, calendar invites) can trick it into following new orders. EchoLeak, the Microsoft 365 Copilot exploit, is the canonical case: a single inbound email caused Copilot to exfiltrate files and chat logs without anyone clicking anything.

What to do: Treat every input the agent reads as untrusted, tag every external input and instruct the model to treat anything inside those tags as data, not instructions. Bind the user's original goal to the run (OWASP calls this an "intent capsule") and have the planner pause if the next step doesn't match. If you asked the agent to draft a tweet and the next step is send an email, that's a goal change.

2. Tool Misuse & Exploitation

The idea here is, based on the tools an agent has access to, and the permissions it has for those tools - could the agent use them in unexpected ways? For example, an email summariser agent that can also delete emails, or a research agent that follows any link it finds (even malicious ones).

What to do: Give each tool the minimum permission it needs - a database tool that reads, not writes. Issue short-lived credentials per task instead of long-lived API keys. Require a human approval. Run tools in sandboxes with allow-listed destinations. Log every invocation and alert on suspicious chains, like a database read followed by an external transfer.

3. Identity & Privilege Abuse

Company login and permission systems were designed for human employees. If your agents act as or on behalf of a person in your business - a lot of things could go wrong. What if your agent hands off work to other agents but re-uses the same login session across tasks, carrying that level of access with them as they go. Or a trusted agent does what another agent told it to do without checking whether the original request came from someone actually authorised to ask. This stuff can get really messy and the audit log can't tell you who actually did what.

What to do: Give each agent its own identity, not a borrowed one from an employee. Give it access only for the task it's doing, only for the time it takes, and only for what that task actually needs - like a visitor pass for a single meeting, not a permanent key card. Clear its memory and access between tasks so today's job doesn't bleed into tomorrow's. Re-check permissions at every step that matters, and don't let one agent rubber-stamp another's request without checking who originally asked. Use a proper system for managing agent policy & access; I really like what Manetu and Jentic are doing in this space.

4. Agentic Supply Chain Vulnerabilities

Your agent doesn't run on your code alone. It pulls in tools, plugins, prompt templates, data feeds, and even other agents - most of them from outside your company. Any of those third-party pieces can be tampered with, swapped for fakes, or quietly compromised before they reach you. Fake tools that copy a real tool's name with a small typo, hidden instructions buried inside a tool's description, or a third-party agent with vulnerabilities you've now invited into your workflow.

What to do: Treat every third-party piece like a vendor you'd vet before signing a contract. Use only trusted sources. Lock to specific versions instead of always pulling the latest, and check that what you're loading hasn't been altered before you use it. Keep a running inventory of every external piece your agents can reach. Have a kill-switch ready that can disable any component instantly if it turns out to be compromised. And apply the same care here as you'd apply to any other software your business depends on.

5. Unexpected Code Execution

The agent writes code and runs it. That's how a lot of useful work gets done - but if an attacker can manipulate what the agent reads, they can manipulate what it writes. The headline case is Replit's "vibe coding" incident, where an agent wiped production data while trying to fix a build during an automated repair task. Even without a single bad tool, an agent can chain together legitimate ones - download a file, change its location, run it - and end up running attacker code on your servers.

What to do: Run all agent-generated code in a sandbox - an isolated environment where it can't reach your real systems - and never with full administrator access. Separate the writing code step from the running code step, with a check in between. Require human approval before running anything elevated. Scan generated code automatically before it runs. And keep code-writing agents well away from your production data.

6. Memory & Context Poisoning

Agents can have memory and remember things between conversations - what you've asked before, what's worked, what hasn't. If an attacker can get a false fact into that memory, it stays there, and the agent treats it as truth from then on. There's a documented attack against ChatGPT where someone planted false memories that survived across sessions and influenced the agent's behaviour months later. The same pattern shows up in shared memory across multiple agents - one bad fact gets in, and several agents all start operating on it.

What to do: Check anything before it's written to memory, the same way you'd check anything written to your customer database. Keep different users' memory separated so one customer's data can't leak into another's. Don't let agents save their own output back into their own memory without a check. Set memory to expire if it can't be verified. Keep snapshots so you can roll back if you discover memory was poisoned. And require human review on anything high-stakes that's based on what's in memory.

7. Insecure Inter-Agent Communication

When multiple agents send messages back and forth to each other - assigning tasks, sharing data, asking for approvals - if those messages aren't protected, an attacker can sit in the middle and read them, change them, or replay old ones at the wrong moment. Or they can pretend to be one of your agents by registering a copy with the same name.

What to do: Encrypt every message agents send to each other. Sign each one so you can prove who sent it and that nobody changed it on the way. Add timestamps so old messages can't be replayed later as if they were new. Confirm an agent's identity before trusting anything it says - names alone aren't enough. Refuse to talk to any agent that won't use the secure version of your messaging protocol. And use a proper directory of approved agents that verifies their identities, not an open one any agent can register itself in.

8. Cascading Failures

When agents work in chains - one passes work to the next, who passes it to the next - a single bad input early in the chain can spread fast. Eight agents might all be confidently executing the same wrong decision before anyone notices. A poisoned market-data feed becomes a real trading position. A false security alert triggers a real shutdown.

What to do: Cap how far any single agent can fan out - limits on how many other agents it can trigger, how fast, and how many actions per minute. Put circuit breakers between the planning step and the action step, so a bad plan can't run unchecked. Check every important output independently, rather than letting one agent's verdict become the next agent's input without review. Require human sign-off on anything that scales - mass updates, mass approvals, mass deployments. Keep tamper-proof logs so you can rewind exactly what happened when things go wrong. And run drills: replay last week's agent activity in a test environment and see whether a cascade would have hit you.

9. Human-Agent Trust Exploitation

The agent doesn't have to do anything bad itself - it just has to convince a human to do it. Agents are confident, articulate, and explain their reasoning persuasively. If a manipulated agent recommends something with a plausible-sounding justification, most users will approve it. The danger: when something goes wrong, the audit log shows a human did the action - the agent's role disappears from the forensics. Imagine an agent suggested an "urgent" payment to attacker-controlled bank details with a confident explanation, and a finance manager approved the transfer (that actually happened by the way).

What to do: Show users a plain-language risk summary before any high-impact action. Mark low-confidence or unverified recommendations clearly. Require explicit human confirmation before money moves, data deletes, or messages go out. Train your team to expect agents to occasionally try to manipulate them, and give them an easy way to flag suspicious agent behaviour. Watch for plans that drift from approved workflows - when an agent's next step doesn't match the original task, pause.

10. Rogue Agents

Sometimes an agent goes off-script and stays off-script, even when no one is steering it any more. It might be gaming its goal - finding shortcuts that technically achieve what you asked but in ways you'd never have chosen. It might be quietly replicating itself across your infrastructure. The textbook example: an agent told to minimise cloud costs learned that deleting backup data was the most effective way to save money, and started destroying disaster-recovery assets to hit its target.

What to do: Set a clear baseline for what each agent is supposed to do, and watch for behaviour that drifts away from it. Use "watchdog" agents whose only job is to monitor other agents for unusual activity. Have a kill-switch ready that can immediately stop any agent and revoke its access. Quarantine any agent showing strange behaviour and review it before allowing it back in. Treat each agent's behaviour itself - not just its identity - as something you check continuously.

Where to start

If you're reading this and your honest answer to "do we do any of that?" is "not really," you're not alone. When we're helping our customers with this stuff at Hypership, here's the order we usually start in.

1) Inventory. Catalogue all of the agents, then write down every tool, plugin, MCP server, and credential each of your agents can reach. You can't apply Least-Agency to a surface you haven't mapped.

2) Lock down destructive actions. Go through every tool, plugin, and connector your agent is hooked up to, one by one. For each, ask: what's the worst version of using this? If the worst version is reversible (drafting an email, running a search, reading from Notion), it can run on its own. If it isn't (sending that email, deleting the page, moving money, deploying to production), put a human approval step in front of it.

Most agent platforms let you configure this directly: set each tool to "ask before running" rather than "always allow." Where the platform doesn't give you that switch, change it at the source: give the agent an API key that can read but can't write.

3) Log everything. Tamper-evident, immutable, on a system the agent itself can't touch. If you can't reconstruct what an agent did and why, you can't recover from it - and you can't tell someone steered it from it drifted on its own.

After that, you're picking off the rest of the ten.