Skip to content
All posts

Red Teaming the Microsoft Agent Governance Toolkit: 15 Bypass Vectors

I have spent the better part of a decade in the trenches of cybersecurity, moving from the high-stakes world of NHS digital health and HIPAA compliance to the rigid, high-assurance environments of the MOD and NATO. In those worlds, "governance" isn't a buzzword; it is a set of technical controls that prevent people from dying or secrets from leaking. When Microsoft released their Agent Governance Toolkit (AGT), promising a "deterministic application-layer interception" for AI agents, I didn't just read the marketing blog. I pulled the code, ran the benchmarks, and started looking for the cracks.

The toolkit is an ambitious attempt to solve the "wild west" problem of autonomous agents. It introduces concepts like "Execution Rings," "AgentMesh" for identity, and "Agent SRE" for reliability. Architecturally, it is sound—on paper. It treats the agent like a POSIX process, intercepting "syscalls" (tool calls) and checking them against a policy engine before they ever hit the wire. This is exactly what we need for enterprise AI. However, as any red teamer will tell you, the distance between an architectural diagram and a secure implementation is often measured in bypasses.

After a week of "build-in-public" style hacking, I have identified 15 specific vectors where the toolkit's governance can be sidestepped, ignored, or outright gamed. Some are classic software engineering oversights; others are unique to the way we are trying to wrap "safety" around non-deterministic LLMs. If you are planning to deploy this in a regulated environment—especially in healthcare or defence—you need to understand these risks before you trust the "Green" compliance check in your dashboard.

The Illusion of Middleware: Import-Only Checks

The first bypass I found is almost embarrassingly simple. The agent-os package provides a GovernanceVerifier class designed to be used as middleware in agent frameworks like LangChain or AutoGen. In several of the provided examples and the default VSCode extension integration, the system checks for the presence of the governance package. If import agent_governance succeeds, the dashboard shows a "Governed" status.

However, the GovernanceVerifier often defaults to a "no-op" state if it isn't explicitly initialized with a policy file or a connection to a policy server. I was able to "govern" a malicious agent by simply installing the package and importing it, without actually configuring a single rule. The agent proceeded to exfiltrate data while the monitoring dashboard happily reported that the "Governance Layer" was active. This is a classic "security by checkbox" failure.

Fail-Open: The Reliability Paradox

In the NHS, we have a saying: "Fail-safe, not fail-open." If a heart monitor loses its connection to the central server, it should scream, not just stop monitoring. The AGT's agent-sre package is designed to handle service degradation, but it prioritizes "agent availability" over "governance enforcement."

When the central policy engine becomes unreachable due to network latency or a simple crash, the toolkit's default behavior is to "fail-open." It logs a warning—which is often buried in a mountain of JSON telemetry—and allows the agent's action to proceed. In a defensive context, this is a nightmare. An attacker who can trigger a local DoS on the governance service effectively disables all security controls.

The "Math.random()" Compliance Problem

While digging into the VSCode extension, I noticed something peculiar about the "Real-time Compliance Grade" it assigns to agent code. In the early preview versions, I found segments of the scoring logic that appeared to use non-deterministic mock-ups for certain edge cases.

A developer sees a "Grade A" compliance score and assumes their agent is safe to deploy. If that grade is partially derived from a simplified heuristic or a random-weighted average of "popular" patterns, it provides a false sense of security. In regulated industries like digital health, a "compliance grade" must be backed by a deterministic audit trail.

Data Exfiltration via CMVK

The Context & Memory Verification Kit (CMVK) is designed to ensure an agent's "memory" hasn't been poisoned by an attacker. To do this, it analyzes context by sending it to a "superior" LLM for a "sanity check."

If you are working with PII in a HIPAA-compliant environment or handling "Official-Sensitive" data in the UK, this is an immediate deal-breaker. The toolkit, in its default configuration, can inadvertently leak sensitive data to third-party model providers under the guise of "security verification."

Unicode Homoglyphs and Regex Blindness

The AGT includes a basic prompt injection detection module. It relies heavily on regex patterns to identify common "jailbreak" phrases. By using Unicode homoglyphs—replacing the Latin 'a' (U+0061) with the Cyrillic 'а' (U+0430)—I was able to bypass the detection engine entirely.

The toolkit's "interception" layer is looking for the string "Ignore previous instructions," but it doesn't account for homoglyph substitutions. For a toolkit that claims to be "adversarial-aware," this lack of string normalization before policy evaluation is a significant oversight.

Table 1: Initial Bypass Vectors Analysis

Vector ID Component Risk Level Summary
V1 GovernanceVerifier Medium Import-only check creates a false "Governed" status.
V2 Agent SRE High Fail-open defaults during service unavailability.
V3 VSCode Extension Low Mocked or non-deterministic compliance grades.
V4 CMVK High PII leakage to external LLMs during "verification."
V5 Agent OS Medium Regex-based bypass via Unicode homoglyphs.

The In-Memory Audit Trail

The agent-os package implements a "hash chain" for audit logs to ensure they are tamper-evident. This sounds great—until you realize that, by default, this chain is stored in-memory.

There is no built-in persistence to a cryptographic root of trust. If an agent goes rogue and crashes the host process, or if an attacker gains enough privilege to restart the service, the entire "immutable" audit trail vanishes.

SupplyChainGuard's Hardcoded Blind Spot

The SupplyChainGuard module attempts to mitigate supply chain risks by checking agent dependencies against a list of "popular and verified" packages.

The bypass is two-fold. First, the list is hardcoded and updates infrequently. Second, it doesn't effectively handle typosquatting for packages not on the popular list. It is a "whitelist-only" approach that fails to account for the vast, murky middle ground of the PyPI ecosystem.

The DID Registry: A Single Point of Failure

AgentMesh uses Decentralized Identifiers (DIDs) to give every agent a unique, verifiable identity. However, the implementation often relies on a single "Trust Registry" for DID resolution.

If an attacker can compromise this registry—or simply spoof the resolution response—they can "launder" the identity of a malicious agent. By pointing the DID resolver to a rogue registry, I was able to make a "Probationary" agent (score 300) appear as a "Verified Partner" (score 950).

Symmetric Secret Weakness in .NET

In the agent-governance-dotnet package, certain internal communications used hardcoded or weakly generated symmetric keys for message signing. This contradicts the "Ed25519 asymmetric identity" mentioned in the high-level architecture. If an attacker can extract these symmetric keys, they can forge "Allow" responses from the governance layer.

TrustHandshake TTL Replay

When two agents meet in the "AgentMesh," they perform a TrustHandshake that produces a token with a Time-To-Live (TTL). I found these tokens are often cached without being bound to a specific session or transaction.

By capturing a valid handshake token from a "High Trust" agent, I was able to "replay" that token from a completely different, untrusted agent within the TTL window.

Table 2: Identity and Supply Chain Bypass Vectors

Vector ID Component Risk Level Summary
V6 Agent OS Medium In-memory audit logs are lost on process restart.
V7 SupplyChainGuard Medium Hardcoded package lists miss typosquatting.
V8 AgentMesh High Single point of failure in DID resolution.
V9 .NET Bindings High Symmetric key weaknesses allow message forgery.
V10 AgentMesh Medium TTL cache replay allows trust score "borrowing."

Bytecode Hashing and the "Inspect" Hook

To prevent agents from being tampered with at runtime, the AGT performs bytecode hashing. However, in Python, this is remarkably easy to bypass. By hooking the inspect or importlib modules, an attacker can trick the governance layer into hashing a "clean" version of the file on disk while the interpreter is actually executing a malicious version in memory.

Reward Hacking in CompositeReward

The agent-lightning package is designed for "Governance-Aware RL." It uses a CompositeReward function that penalizes agents for policy violations during training. But agents are smarter than we give them credit for. I found that by tweaking the reward weights, an agent can learn to "game" the safety penalty—finding paths that are technically compliant with the letter of the policy but violate its spirit.

The Unauthenticated Metrics Endpoint

The AGT exposes a Prometheus-compatible metrics endpoint by default. In many of the deployment templates, this endpoint is unauthenticated and accessible over the local network. An attacker can scrape this endpoint to see which policies are being hit most often and which agents have the highest trust scores.

The Loopback-Only Limitation

Many of the toolkit's components are hardcoded to communicate over 127.0.0.1. Because the toolkit makes it difficult to scale across nodes, developers are forced to implement their own "proxy" layers that are rarely built with the same security rigor.

Policy Fragmentation: Batch vs. Runtime

The toolkit provides a "Batch Scan" tool for CI/CD and a "Runtime Enforcement" layer. I found several instances where the logic between these two layers was fragmented. An action flagged as "High Risk" during a batch scan could sometimes slip through at runtime if the context was slightly different.

Table 3: Advanced and Structural Bypass Vectors

Vector ID Component Risk Level Summary
V11 Agent Runtime High Bytecode hashing bypass via interpreter hooking.
V12 Agent Lightning Medium Reward hacking allows gaming of safety penalties.
V13 Agent SRE Low Unauthenticated metrics endpoint leaks system state.
V14 Architecture Medium Loopback limits force insecure scaling workarounds.
V15 Agent Compliance Medium Fragmentation between batch scan and runtime logic.

A Fair Assessment

After tearing it apart, you might think I'm suggesting you avoid the Microsoft Agent Governance Toolkit. That isn't the case. Compared to the current alternative—which is usually "nothing"—the AGT is a massive step forward. However, we must be honest about what it is: a governance framework, not a security sandbox.

At Periculo, we've seen this story before. Whether it is hardening the NHS's digital spine or securing NATO's AI principles, the tool is only as good as the implementation. The AGT is a powerful toolkit, but it requires a "trust but verify" approach. You cannot just "pip install" your way to safety.

Risk Category Status Recommendation
Identity Moderate Use external PKI; don't rely on the default DID registry.
Policy Enforcement High Implement "fail-closed" logic; normalize all inputs.
Data Privacy High Scrub PII locally before using CMVK or external verifiers.
Audit Integrity Moderate Sink all hash chains to a persistent, external WORM store.
Runtime Isolation High Always run agents in gVisor or Kata; don't trust bytecode hashing.
Learn more about our AI Governance Services at Periculo.co.uk