Whitepaper
Practical Evals for Agentic Systems
Evaluation discipline for AI systems that retrieve, reason, call tools, and act over time.
Abstract
Agentic systems cannot be evaluated only by whether their final answer sounds correct. Useful evaluations need to measure task completion, side effects, tool selection, policy violations, recovery, approval behavior, cost, latency, and whether a human can reconstruct what happened.
Evaluation Stack
This paper defines an eval stack for consequential AI workflows: component evals, scenario evals, adversarial evals, and production evals. Each layer answers a different question about whether the system is useful, safe, observable, and governable.
- Component evals for retrieval, classification, tool selection, summarization, and policy judgment.
- Scenario evals for realistic multi-step work under constraints.
- Adversarial evals for injection, malformed state, deceptive content, and broken tools.
- Production evals for drift, near misses, approval patterns, and trace quality.