Whitepaper

Practical Evals for Agentic Systems

Evaluation discipline for AI systems that retrieve, reason, call tools, and act over time.

Evening Star AIAgent evaluationGoverned action

Abstract

Agentic systems cannot be evaluated only by whether their final answer sounds correct. Useful evaluations need to measure task completion, side effects, tool selection, policy violations, recovery, approval behavior, cost, latency, and whether a human can reconstruct what happened.

Evaluation Stack

This paper defines an eval stack for consequential AI workflows: component evals, scenario evals, adversarial evals, and production evals. Each layer answers a different question about whether the system is useful, safe, observable, and governable.

Component evals for retrieval, classification, tool selection, summarization, and policy judgment.
Scenario evals for realistic multi-step work under constraints.
Adversarial evals for injection, malformed state, deceptive content, and broken tools.
Production evals for drift, near misses, approval patterns, and trace quality.