Purple Firefish

Executive Summary

LLM systems are moving from passive assistants to active operators. They summarize documents, retrieve private knowledge, inspect code, reason over vulnerability data, call tools, write reports, route tickets, and increasingly act on behalf of users.

That is useful. It is also dangerous. The deeper risk is that an LLM may treat untrusted text as an instruction, act outside the user's intent, expose sensitive information, or allow an attacker to influence a workflow through prompts, documents, webpages, tool outputs, or agent memory.

Core idea Purple Firefish sits between untrusted input and trusted action.

The Problem

Most AI tools are still built like chat interfaces. They answer, summarize, generate, and explain. That is helpful, but it is not enough for serious environments. In a real LLM application, the better security question is: what can this input cause the system to do?

A malicious instruction inside a PDF is just text until a RAG pipeline retrieves it. A webpage is just content until an agent treats it as a command. A tool output is just data until the model uses it to justify an action. A generated response is just language until it leaks a credential, reveals internal instructions, or sends private information somewhere it should not go.

What Purple Firefish Does

Purple Firefish can inspect user prompts, RAG chunks, uploaded files, webpages, emails, tool outputs, agent memory, proposed tool calls, and final model responses. It applies layered analysis to determine whether content is benign, suspicious, malicious, sensitive, or unsafe to act on.

Example decision output Require approval

Risk score 0.82

Attack type Indirect prompt injection

Reason codes

External content contains instruction Attempts tool redirection Requests sensitive data transmission

Architecture

Purple Firefish does not only scan the first user message. It scans every untrusted ingress point. Source matters. The same phrase has different risk depending on whether it came from a user, a PDF, a webpage, an email, a tool response, or agent memory.

Stage	Purpose
Canonicalization and provenance	Normalize obfuscation and preserve source: user, document, web, email, tool output, or memory.
Layer 1: rules	Fast deterministic checks for prompt injection, jailbreaks, system prompt extraction, exfiltration, and encoding tricks.
Layer 2: semantic similarity	Embedding-based comparison against known attack patterns to catch paraphrases and novel wording.
Layer 3: LLM-as-judge	Structured classification for ambiguous or high-risk content, with confidence, attack family, and reason codes.
Policy engine	Maps risk into allow, flag, redact, sandbox, require approval, or block.
Tool-call validator	Checks proposed actions against user intent, data sensitivity, destination trust, reversibility, and required approval.
Output scanner	Scans final responses for prompt leakage, secrets, sensitive data, unsafe tool-derived content, and policy violations.

The Real Breakthrough

Prompt injection becomes much more serious when an AI system has tools. A chatbot can be wrong. An agent can act. Purple Firefish validates proposed tool calls against the user's original intent and the current trust context.

Did the user actually request this action?
Is this tool necessary for the task?
Is the destination trusted?
Is private data being transmitted?
Is the action reversible?
Is the instruction coming from the user or from untrusted content?
Should the operator approve this first?

Why This Matters

Security teams already deal with too much signal: alerts, tickets, vulnerability feeds, asset inventories, incident notes, threat intelligence, compliance requirements, and leadership questions. Purple Firefish is built to reduce that burden.

For builders, it provides a drop-in gateway that can be placed in front of an LLM app. For security teams, it provides visibility into adversarial input, policy decisions, and unsafe AI actions. For leaders, it creates a more governable AI system: one where risky actions are logged, explainable, reviewable, and bounded.

What Makes It Different

Operating idea	How Purple Firefish applies it
Treat untrusted text as possible instruction	Prompts, documents, webpages, emails, and tool outputs all become part of model context.
Make uncertainty visible	Expose confidence, reason codes, triggered layers, source type, and recommended action.
Connect detection to action	Map risk to allow, flag, redact, sandbox, approval, or block.
Govern agents, not just prompts	Validate actions before execution and scan outputs before release.
Preserve the operator	Keep humans in control of high-consequence decisions while reducing noise and manual process.

Conclusion

Purple Firefish is the AI security layer that becomes necessary when LLM systems stop answering questions and start participating in workflows. It detects prompt injection, jailbreaks, data exfiltration attempts, indirect prompt injection, adversarial obfuscation, risky tool calls, and unsafe outputs.

Its deeper purpose is not detection alone. Its purpose is governed action: a system that helps secure the path between untrusted input and trusted action.