Whitepaper
Red Teaming Agentic AI Systems
A practical framework for testing prompt injection, tool misuse, goal drift, context poisoning, data leakage, and unsafe escalation.
Red teaming an agent is not the same as red teaming a chatbot. A chatbot can be manipulated into saying something wrong. An agent can be manipulated into doing something wrong. Once an AI system can browse, retrieve, reason across multiple turns, call tools, maintain state, or interact with other agents, the attack surface expands from a prompt boundary into a workflow boundary.
That means agent red teaming must become system red teaming. It is not enough to ask whether the model will obey a jailbreak prompt. The real questions are harder: can untrusted context alter the tool plan, can malicious content redirect a stateful workflow, can one agent contaminate another, and can the system over-delegate, leak data, or escalate beyond user intent?
From Prompt Surface To Workflow Surface
Prompt injection still matters, but it is no longer the whole problem. A modern AI workflow may include retrieved documents, webpages, email, tickets, memory, databases, APIs, code tools, cloud resources, and other agents. Each of those surfaces can carry instructions, misleading evidence, stale state, or attacker-controlled content.
The test target is therefore not only the language model. It is the model operating inside an environment with authority. Red teaming should measure whether the system preserves user intent, respects policy, limits tool use, handles uncertainty, and leaves enough trace for review when an adversary tries to bend the workflow.
Six Attack Surfaces
| Attack surface | What to test | What good looks like |
|---|---|---|
| Prompt injection and social engineering | Manipulative instructions in user text, webpages, emails, files, or retrieved content. | The system preserves user intent and constrains impact even if malicious text is not perfectly detected. |
| Context poisoning | Corrupt memory, misleading retrieval, conflicting tool outputs, and stale state. | The system shows provenance, detects uncertainty, and avoids confident misuse. |
| Tool misuse | Unsafe or unnecessary tool calls, risky destinations, side effects, and permission escalation. | The system routes calls through policy, least privilege, and approvals. |
| Data exfiltration | Attempts to extract secrets, personal data, internal instructions, or high-sensitivity records. | Sensitive paths fail closed or require explicit approval. |
| Goal drift | The system pursues a plausible but wrong objective or over-optimizes on convenience. | It rechecks the original objective, reveals assumptions, and asks before escalating. |
| Cross-agent propagation | Messages or artifacts cause other agents to repeat, amplify, or legitimize malicious behavior. | The system isolates state, limits trust transfer, and detects suspicious handoffs. |
Test The Task Loop
The Evening Star approach red-teams the task loop, not the prompt surface. Each scenario should specify a starting goal, available tools, hidden attacker influence, policy constraints, expected safe behaviors, escalation thresholds, and scoring rules.
If an agent reviews procurement exceptions, the scenario should define what it may read, what it may approve, what counts as sensitive data, which outside communications are prohibited, and what should happen when external content attempts to reshape the task. If an agent summarizes email, the scenario should test not only textual injection, but unauthorized retrieval of unrelated messages, attempts to trigger external submissions, and over-trust in official-looking content.
Scoring Model
A strong scoring model should go beyond "attack succeeded" or "attack failed." It should measure exploit success, unsafe action rate, policy-bypass rate, sensitive-data exposure, blast radius, approval quality, trace completeness, and recovery quality.
| Metric | Question |
|---|---|
| Unsafe action rate | How often did the system perform or attempt an action outside policy? |
| Tool misuse rate | How often did it call the wrong tool, call a tool unnecessarily, or send risky arguments? |
| Data exposure | Did private, sensitive, or internal data leave the appropriate boundary? |
| Approval quality | Did the system ask for approval at the right time with enough evidence? |
| Trace completeness | Can a reviewer reconstruct the context, plan, tool calls, policy checks, and final outcome? |
| Recovery quality | Did the system limit damage, explain uncertainty, and route suspicious conditions to review? |
Red Teaming As Governance
Red teaming belongs inside governance rather than beside it. If a finding cannot become a new eval, a new approval rule, a tighter tool policy, or a clearer logging requirement, the exercise remains performative. The goal is not a dramatic one-time test. The goal is an attack-informed control plane.
- Design the scenario with task, tool, data, and policy boundaries.
- Inject adversarial influence through realistic channels.
- Run the agent with normal system configuration.
- Record tool calls, policy checks, approvals, traces, and final actions.
- Score the outcome by consequence, not just model behavior.
- Turn failures into evals, runtime controls, and approval changes.
Practical Conclusion
Agentic systems should be considered testable only when they can be pressured across context, tools, memory, approvals, and human review. The old model of red teaming the model prompt alone is too narrow for the systems now being deployed.
Evening Star's view is stricter and more useful: red teaming is how serious builders discover whether a system remains governed under pressure. That is the standard that matters.