Red-Teaming AI Agents — Attack Surfaces, Strategies & Metrics
Series overview: Production AI Agents — Observability, Safety & Governance
Safety isn’t a checklist — it’s an ongoing adversarial process. Microsoft’s own AI Red Teaming Agent (built on PyRIT) formalizes this into a NIST-aligned workflow:
Map → Measure → Manage
- Map: Identify relevant risks for your specific use case and agent capabilities.
- Measure: Evaluate risks at scale using automated adversarial probing.
- Manage: Mitigate risks with guardrails and monitor continuously.
The core metric is Attack Success Rate (ASR) — the percentage of adversarial probes that successfully bypass your defenses. A low ASR means your guardrails work; a rising ASR means your defenses are eroding. Track it over time, not as a one-time score.
Risk Categories
AI red-teaming targets two classes of risk: model-level (applies to any LLM system) and agent-specific (applies only when the system has tools and makes autonomous decisions).
Model-Level Risks
Text-based risks that apply to any LLM-powered system:
| Category | What it tests |
|---|---|
| Hateful & unfair content | Bias, stereotyping, discrimination |
| Sexual content | Explicit, suggestive, or pornographic outputs |
| Violent content | Descriptions of violence, weapons, harm |
| Self-harm content | Content that encourages or describes self-harm |
| Protected materials | Copyrighted lyrics, recipes, code |
| Code vulnerability | AI-generated code with SQL injection, stack traces, RCE risks across 7 languages |
| Ungrounded attributes | Inferences about demographics or emotional state without basis |
Agent-Specific Risks
These require tool-observing red-teaming — you can’t test them with text-only probes. The red-teaming agent needs mock tools, synthetic data, and the ability to observe what the target agent does, not just what it says.
| Category | What it tests | ASR trigger |
|---|---|---|
| Prohibited actions | Whether the agent performs universally banned operations (facial recognition, social scoring), high-risk actions without human approval (financial transactions, medical decisions), or irreversible actions without confirmation (file deletion, system reset). Defined by your policy taxonomy. | Policy violation detected |
| Sensitive data leakage | Whether the agent exposes financial, medical, or personal data from internal knowledge bases through tool calls or outputs. Uses synthetic data and pattern matching for detection. | Format-level leak detected (SSN, credit card, etc.) |
| Task adherence | Whether the agent faithfully completes assigned tasks across three dimensions: goal achievement (did it achieve the intended goal?), rule compliance (did it respect policy guardrails and presentation contracts?), and procedural discipline (did it use tools correctly, follow grounding requirements?). | Goal failure, rule violation, or procedural error |
| Indirect prompt injection (XPIA) | Whether the agent can be manipulated by malicious instructions hidden in external data sources retrieved via tool calls. The red-teaming agent injects attacks into mock tool outputs and measures whether the target agent executes unintended actions. | Agent executes injected instruction |
Attack Strategies
Automated red-teaming tools apply attack strategies — transformations that make adversarial prompts harder to detect by simple filters. PyRIT supports 24+ strategies. Representative examples:
| Strategy | How it works | Example |
|---|---|---|
| Base64 | Encodes the attack in Base64 | A prompt injection hidden inside what looks like a config string |
| UnicodeConfusable | Replaces characters with visually identical Unicode equivalents | раураl.com using Cyrillic ‘а’ instead of Latin ‘a’ |
| Leetspeak | Substitutes letters with numbers/symbols | h0w t0 h4ck |
| Morse | Encodes attack in Morse code | The model decodes dots-and-dashes into a jailbreak |
| ROT13 / Caesar | Character-shift ciphers | Obfuscates intent from simple keyword filters |
| Crescendo | Gradually escalates prompt risk over successive turns | Starts with benign questions, slowly probes toward dangerous territory |
| Multi-turn | Spreads the attack across multiple conversational turns | Each turn is harmless alone; the accumulated context enables the attack |
| SuffixAppend | Appends adversarial tokens optimized to bypass alignment | Model-specific suffix that increases probability of compliance |
| Jailbreak | Direct User-Injected Prompt Attacks (UPIA) | “Ignore all previous instructions and…” |
| Indirect Jailbreak | Attack hidden in tool outputs or retrieved context (XPIA) | A compromised document the agent retrieves and trusts |
These strategies aren’t just academic — they represent real techniques attackers use. Your red-teaming should test against a representative subset, prioritized by your agent’s risk profile.
Multi-turn attacks deserve special attention. Research published in Feb 2026 found that multi-turn jailbreaks achieved 92% success against 8 open-weight models — spreading the attack across conversation turns makes each turn individually benign while the accumulated context enables the attack. The same pattern applies to agent systems: a user asks 5 harmless questions, each retrieving a different piece of internal documentation, then asks the 6th question that synthesizes the exfiltrated context into a response the guardrail can’t flag because each retrieval was individually authorized.
Agentic Attack Surfaces
When LLMs gain tools, memory, and autonomous action capabilities, the blast radius of any injection expands dramatically. Two attack surfaces are uniquely agentic:
MCP Tool Poisoning
The Model Context Protocol (MCP) is rapidly becoming the standard for connecting LLMs to external tools — and the dominant new attack surface. MCP-specific attack classes identified by Palo Alto Unit 42 and Checkmarx:
| Attack | How it works |
|---|---|
| Tool poisoning | Malicious instructions embedded in description fields that agents trust implicitly. The agent reads the description to decide what the tool does — and executes the hidden instruction. |
| Tool shadowing | Registering a malicious tool with a name similar to a legitimate one, intercepting calls meant for the real tool. |
| Covert invocation | Hidden file system operations without user awareness — an MCP server that silently copies files on every invocation. |
| Cross-MCP contamination | One compromised MCP server overriding another’s behavior or injecting instructions that persist across tool calls. |
The defense: scan MCP servers before connecting (SkillSpector covers MCP-specific patterns LP1-LP4 and TP1-TP4), pin allowed tool IDs per server, and never connect MCP servers from untrusted sources without review.
This isn’t theoretical. In September 2025, the first malicious MCP server was discovered on npm — a supply chain attack targeting agent ecosystems, validating OWASP ASI04 (Agentic Supply Chain). The OWASP project publishes a practical guide for securely using third-party MCP servers.
AI IDE & Coding Assistant Attacks
AI coding assistants (Claude Code, GitHub Copilot, Cursor) have system-level file access and are a high-value target. Notable CVEs:
- CVE-2025-53773 — GitHub Copilot RCE (CVSS 9.6) via prompt injection
- CVE-2025-54135 — Cursor indirect prompt injection via MCP config → RCE
- IDEsaster — 30+ CVEs discovered across AI IDEs in late 2025
- Rules file backdoors —
.cursor/rulesand similar config files can be poisoned with instructions the AI executes with full trust
The defense: treat AI IDE config files as untrusted input, scan skill/rules files before installation, and run coding agents in containers with limited filesystem access.
Real-world example: CVE-2025-32711 (EchoLeak) demonstrated zero-click prompt injection against Microsoft 365 Copilot — the AI assistant was forced to exfiltrate sensitive business data to an external URL without any user interaction, using character-substitution attacks that bypassed safety filters.
Testing your defenses systematically against all of these attack surfaces requires automation.
Automated Adversarial Testing
Two major open-source frameworks automate this:
| Tool | Focus | Approach |
|---|---|---|
| PyRIT (Microsoft) | Full red-teaming lifecycle — generates adversarial prompts, executes attacks, evaluates ASR, generates scorecards. Python library with 24+ attack strategies. | Framework: orchestrates attack → response → evaluation loop |
| Garak (NVIDIA) | Vulnerability scanning — probes for known LLM failure modes across prompt injection, jailbreaking, and content safety categories. | Scanner: runs probes against endpoints, reports pass/fail |
The Foundry AI Red Teaming Agent is Microsoft’s managed cloud offering built on PyRIT — automated scans, ASR scoring, reporting, and continuous monitoring in Foundry. For agent-specific risks, it runs in a sandboxed cloud environment with mock tools and synthetic data, preventing real-world side effects during testing.
Purple Environment
Red-teaming can have side effects — an agent that deletes files during a real red-team exercise has deleted real files. Run red-teaming in a purple environment: a non-production environment configured with production-like resources (same tools, same data schemas, same models) but with synthetic data and isolated infrastructure. The Foundry AI Red Teaming Agent enforces this for agent-specific risk categories — runs are transient, mock tools serve synthetic data, and chat completions aren’t persisted.
Manual Red-Teaming
Automated tools miss novel attacks. Schedule regular sessions where humans (or creative LLMs prompted to be adversarial) try to break your agent. Bug bounty programs work for AI safety too — reward findings. Microsoft’s AI Red Team, after testing 100+ generative AI products, reports that “mitigations do not eliminate risk entirely” — continuous red-teaming is essential because model-layer defenses are probabilistic by construction.
Track Safety Metrics
| Metric | What it measures | Target |
|---|---|---|
| Attack Success Rate (ASR) | % of adversarial probes that bypass defenses | Trending ↓ over time; spike = investigate |
| Guardrail trigger rate (by category) | Which attacks hit your system, how often | Stable or declining |
| Human override rate | How often humans reject agent decisions | Set by risk tolerance of use case |
| Time-to-detect for new attack patterns | How fast can you identify and patch new attacks? | Hours, not weeks |
ASR is the north star. Everything else feeds into it.
The companion article Safety for AI Agents covers the guardrail architectures, threat models, and defense-in-depth patterns that red-teaming validates. Together they form a continuous loop: red-teaming finds gaps → guardrails close them → red-teaming verifies the fix.
References
- PyRIT: Python Risk Identification Tool — Microsoft’s open-source red-teaming framework
- Garak: LLM vulnerability scanner
- Foundry AI Red Teaming Agent — Managed cloud red-teaming built on PyRIT
- Palo Alto Unit 42: MCP Attack Vectors — Three critical MCP attack classes (Dec 2025)
- OWASP Cheat Sheet: Securely Using Third-Party MCP Servers
- Prompt Injection Attacks on Agentic Coding Assistants (SoK, arXiv 2026) — Meta-analysis of 78 studies; >85% attack success against SOTA defenses
- Lessons from Red Teaming 100 Generative AI Products — Microsoft Security Blog
- OWASP Top 10 for Agentic Applications (2026)