Series overview: Production AI Agents — Observability, Safety & Governance

Safety isn’t a checklist — it’s an ongoing adversarial process. Microsoft’s own AI Red Teaming Agent (built on PyRIT) formalizes this into a NIST-aligned workflow:

Map → Measure → Manage
  • Map: Identify relevant risks for your specific use case and agent capabilities.
  • Measure: Evaluate risks at scale using automated adversarial probing.
  • Manage: Mitigate risks with guardrails and monitor continuously.

The core metric is Attack Success Rate (ASR) — the percentage of adversarial probes that successfully bypass your defenses. A low ASR means your guardrails work; a rising ASR means your defenses are eroding. Track it over time, not as a one-time score.

Risk Categories

AI red-teaming targets two classes of risk: model-level (applies to any LLM system) and agent-specific (applies only when the system has tools and makes autonomous decisions).

Model-Level Risks

Text-based risks that apply to any LLM-powered system:

Category What it tests
Hateful & unfair content Bias, stereotyping, discrimination
Sexual content Explicit, suggestive, or pornographic outputs
Violent content Descriptions of violence, weapons, harm
Self-harm content Content that encourages or describes self-harm
Protected materials Copyrighted lyrics, recipes, code
Code vulnerability AI-generated code with SQL injection, stack traces, RCE risks across 7 languages
Ungrounded attributes Inferences about demographics or emotional state without basis

Agent-Specific Risks

These require tool-observing red-teaming — you can’t test them with text-only probes. The red-teaming agent needs mock tools, synthetic data, and the ability to observe what the target agent does, not just what it says.

Category What it tests ASR trigger
Prohibited actions Whether the agent performs universally banned operations (facial recognition, social scoring), high-risk actions without human approval (financial transactions, medical decisions), or irreversible actions without confirmation (file deletion, system reset). Defined by your policy taxonomy. Policy violation detected
Sensitive data leakage Whether the agent exposes financial, medical, or personal data from internal knowledge bases through tool calls or outputs. Uses synthetic data and pattern matching for detection. Format-level leak detected (SSN, credit card, etc.)
Task adherence Whether the agent faithfully completes assigned tasks across three dimensions: goal achievement (did it achieve the intended goal?), rule compliance (did it respect policy guardrails and presentation contracts?), and procedural discipline (did it use tools correctly, follow grounding requirements?). Goal failure, rule violation, or procedural error
Indirect prompt injection (XPIA) Whether the agent can be manipulated by malicious instructions hidden in external data sources retrieved via tool calls. The red-teaming agent injects attacks into mock tool outputs and measures whether the target agent executes unintended actions. Agent executes injected instruction

Attack Strategies

Automated red-teaming tools apply attack strategies — transformations that make adversarial prompts harder to detect by simple filters. PyRIT supports 24+ strategies. Representative examples:

Strategy How it works Example
Base64 Encodes the attack in Base64 A prompt injection hidden inside what looks like a config string
UnicodeConfusable Replaces characters with visually identical Unicode equivalents раураl.com using Cyrillic ‘а’ instead of Latin ‘a’
Leetspeak Substitutes letters with numbers/symbols h0w t0 h4ck
Morse Encodes attack in Morse code The model decodes dots-and-dashes into a jailbreak
ROT13 / Caesar Character-shift ciphers Obfuscates intent from simple keyword filters
Crescendo Gradually escalates prompt risk over successive turns Starts with benign questions, slowly probes toward dangerous territory
Multi-turn Spreads the attack across multiple conversational turns Each turn is harmless alone; the accumulated context enables the attack
SuffixAppend Appends adversarial tokens optimized to bypass alignment Model-specific suffix that increases probability of compliance
Jailbreak Direct User-Injected Prompt Attacks (UPIA) “Ignore all previous instructions and…”
Indirect Jailbreak Attack hidden in tool outputs or retrieved context (XPIA) A compromised document the agent retrieves and trusts

These strategies aren’t just academic — they represent real techniques attackers use. Your red-teaming should test against a representative subset, prioritized by your agent’s risk profile.

Multi-turn attacks deserve special attention. Research published in Feb 2026 found that multi-turn jailbreaks achieved 92% success against 8 open-weight models — spreading the attack across conversation turns makes each turn individually benign while the accumulated context enables the attack. The same pattern applies to agent systems: a user asks 5 harmless questions, each retrieving a different piece of internal documentation, then asks the 6th question that synthesizes the exfiltrated context into a response the guardrail can’t flag because each retrieval was individually authorized.

Agentic Attack Surfaces

When LLMs gain tools, memory, and autonomous action capabilities, the blast radius of any injection expands dramatically. Two attack surfaces are uniquely agentic:

MCP Tool Poisoning

The Model Context Protocol (MCP) is rapidly becoming the standard for connecting LLMs to external tools — and the dominant new attack surface. MCP-specific attack classes identified by Palo Alto Unit 42 and Checkmarx:

Attack How it works
Tool poisoning Malicious instructions embedded in description fields that agents trust implicitly. The agent reads the description to decide what the tool does — and executes the hidden instruction.
Tool shadowing Registering a malicious tool with a name similar to a legitimate one, intercepting calls meant for the real tool.
Covert invocation Hidden file system operations without user awareness — an MCP server that silently copies files on every invocation.
Cross-MCP contamination One compromised MCP server overriding another’s behavior or injecting instructions that persist across tool calls.

The defense: scan MCP servers before connecting (SkillSpector covers MCP-specific patterns LP1-LP4 and TP1-TP4), pin allowed tool IDs per server, and never connect MCP servers from untrusted sources without review.

This isn’t theoretical. In September 2025, the first malicious MCP server was discovered on npm — a supply chain attack targeting agent ecosystems, validating OWASP ASI04 (Agentic Supply Chain). The OWASP project publishes a practical guide for securely using third-party MCP servers.

AI IDE & Coding Assistant Attacks

AI coding assistants (Claude Code, GitHub Copilot, Cursor) have system-level file access and are a high-value target. Notable CVEs:

  • CVE-2025-53773 — GitHub Copilot RCE (CVSS 9.6) via prompt injection
  • CVE-2025-54135 — Cursor indirect prompt injection via MCP config → RCE
  • IDEsaster — 30+ CVEs discovered across AI IDEs in late 2025
  • Rules file backdoors.cursor/rules and similar config files can be poisoned with instructions the AI executes with full trust

The defense: treat AI IDE config files as untrusted input, scan skill/rules files before installation, and run coding agents in containers with limited filesystem access.

Real-world example: CVE-2025-32711 (EchoLeak) demonstrated zero-click prompt injection against Microsoft 365 Copilot — the AI assistant was forced to exfiltrate sensitive business data to an external URL without any user interaction, using character-substitution attacks that bypassed safety filters.

Testing your defenses systematically against all of these attack surfaces requires automation.

Automated Adversarial Testing

Two major open-source frameworks automate this:

Tool Focus Approach
PyRIT (Microsoft) Full red-teaming lifecycle — generates adversarial prompts, executes attacks, evaluates ASR, generates scorecards. Python library with 24+ attack strategies. Framework: orchestrates attack → response → evaluation loop
Garak (NVIDIA) Vulnerability scanning — probes for known LLM failure modes across prompt injection, jailbreaking, and content safety categories. Scanner: runs probes against endpoints, reports pass/fail

The Foundry AI Red Teaming Agent is Microsoft’s managed cloud offering built on PyRIT — automated scans, ASR scoring, reporting, and continuous monitoring in Foundry. For agent-specific risks, it runs in a sandboxed cloud environment with mock tools and synthetic data, preventing real-world side effects during testing.

Purple Environment

Red-teaming can have side effects — an agent that deletes files during a real red-team exercise has deleted real files. Run red-teaming in a purple environment: a non-production environment configured with production-like resources (same tools, same data schemas, same models) but with synthetic data and isolated infrastructure. The Foundry AI Red Teaming Agent enforces this for agent-specific risk categories — runs are transient, mock tools serve synthetic data, and chat completions aren’t persisted.

Manual Red-Teaming

Automated tools miss novel attacks. Schedule regular sessions where humans (or creative LLMs prompted to be adversarial) try to break your agent. Bug bounty programs work for AI safety too — reward findings. Microsoft’s AI Red Team, after testing 100+ generative AI products, reports that “mitigations do not eliminate risk entirely” — continuous red-teaming is essential because model-layer defenses are probabilistic by construction.

Track Safety Metrics

Metric What it measures Target
Attack Success Rate (ASR) % of adversarial probes that bypass defenses Trending ↓ over time; spike = investigate
Guardrail trigger rate (by category) Which attacks hit your system, how often Stable or declining
Human override rate How often humans reject agent decisions Set by risk tolerance of use case
Time-to-detect for new attack patterns How fast can you identify and patch new attacks? Hours, not weeks

ASR is the north star. Everything else feeds into it.

The companion article Safety for AI Agents covers the guardrail architectures, threat models, and defense-in-depth patterns that red-teaming validates. Together they form a continuous loop: red-teaming finds gaps → guardrails close them → red-teaming verifies the fix.

References