Agent Evaluation: How to Test AI Agents That Act, Not Just Answer

Q: What's the difference between pass@k and pass^k?

`pass@k` is success on *at least one* of k attempts (capability); `pass^k` is success on *all* k attempts (consistency). Agents often look strong on `pass@k` and weak on `pass^k` — e.g. ~26% `pass^4` on τ²-bench telecom — and consistency is what determines deployability.

Q: My agent scores poorly — is the model bad?

Not necessarily. You're evaluating the model *and* the scaffold (prompts, tools, context management, structure). Hold a strong scaffold fixed across models, read the transcripts, and use selection/invocation/trajectory metrics to localize the failure before blaming the model.

Q: Q: What's the difference between pass@k and pass^k?

`pass@k` is success on *at least one* of k attempts (capability); `pass^k` is success on *all* k attempts (consistency). Agents often look strong on `pass@k` and weak on `pass^k` — e.g. ~26% `pass^4` on τ²-bench telecom — and consistency is what determines deployability.

Evaluating an AI agent is not the same problem as evaluating a chatbot, and treating them the same is how teams ship agents that demo beautifully and fail in production. A chatbot produces one answer you can grade against a reference. An agent observes, reasons, calls tools, and acts over many turns until it reaches a terminal state — so "did it get the right answer?" is replaced by "did it reach the right final state of the world, and did it get there through a sound process?" As agents move into high-stakes domains like coding and medicine, building this evaluation capability is now table stakes.

This guide is the practical companion to LLM evaluation infrastructure (how to evaluate a single model honestly) and benchmark hacking (how outcome-only scoring breaks once agents can game it). It draws on Cameron R. Wolfe's deep-dive on agent evals and the benchmark families that define the 2026 state of the art. If you want the zoomed-out view of why agent tasks are the hard frontier, see measuring AI progress — agents live in the slow-verification regime.

Key takeaways
Why agent eval is different
The anatomy of an agent eval
Outcome vs. process grading
The three grading methods
Metrics: pass@k, pass^k, and the consistency gap
Tool-use and trajectory metrics
The benchmark landscape: τ-bench and Terminal-Bench
The scaffold-decoupling problem
A 7-step roadmap for building agent evals
Pitfalls and recommendations
FAQ
References

Key takeaways

Agents are evaluated on states and trajectories, not answers. You grade the final state of the environment (outcome) and the sequence of reasoning and tool calls that got there (process).
Consistency is the real test. pass@k (success on at least one of k tries) flatters agents; pass^k (success on all k tries) exposes brittleness. One model scored only ~26% pass^4 on telecom tasks despite looking strong single-shot.
Use three grader types together — code-based (deterministic, reproducible, blunt), model-based (LLM-as-judge: flexible, non-deterministic, biased), and human (the north star, but expensive). Calibrate the LLM judge against humans.
You're evaluating the model and the scaffold together. A bad score can mean a weak model, a weak scaffold (prompts, tools, context management), or both. Don't attribute it to the model by default.
The frontier benchmarks are dynamic and multi-turn. τ-bench (and τ²/τ³) simulate users and shared environments with policy docs; Terminal-Bench tests real terminal tasks. Both are saturating, forcing continuous evolution.
Build small and iterate. Start with 10–20 hand-curated tasks, simple code graders first, and treat the suite as a living artifact — adding new failure cases as you find them.
Single-agent first. Single agents are easier to evaluate and maintain; add multi-agent structure only when one agent's instructions bloat or its tools overlap.

Quick comparison: grading approaches

Grader	What it's good at	Determinism	Cost	Watch out for
Code-based	Objective checks: final state, string/test match	High	Low	No nuance; can't judge subjective quality
Model-based (LLM-as-judge)	Subjective quality, long-form, trajectories	Low	Medium	Non-deterministic; known biases; needs calibration
Human	Hard-to-specify quality, final sign-off	Medium	Very high	Inter-rater agreement; slow; effort to stay calibrated

Why agent eval is different

A standard LLM eval is largely a function: prompt in, completion out, grade against a reference. Agent eval breaks every part of that:

Autonomy over time. The agent runs an agentic loop — observe → reason → act → repeat — until it decides it's done or hits a limit. There's no single output; there's a whole trajectory.
Environment interaction. The agent changes state through tool calls (filesystem, APIs, a database, a browser). Success is often "the environment is now in the right state," not "the text is correct."
Multi-turn, dynamic inputs. Realistic tasks involve a back-and-forth with a (often LLM-simulated) user whose responses depend on what the agent does. You can't pre-script the whole interaction.
Cost and latency are part of the score. Two agents that both succeed are not equal if one burns 5× the tokens or takes 10× as long. Token-usage limits and execution-time constraints belong in the eval.

The upshot: agent eval needs realistic, multi-turn, environment-grounded test cases and graders that can look at both the destination and the path.

The anatomy of an agent eval

It helps to name the moving parts. An agent eval is built from:

Tasks — test cases with defined initial conditions and success criteria.
Trials — individual attempts at a task (you run several; see pass^k).
Transcripts — the full record of a trial: reasoning steps, tool calls, intermediate outputs, messages.
Outcomes — the final state of the environment after a trial.
Graders — the checks that turn a transcript/outcome into a score.

And the agent under test is itself three things — the underlying LLM/reasoning model, the tools it can call, and the instructions that specify expected behavior — wrapped in a scaffold (environment interface, prompting strategy, tool docs, single- vs. multi-agent structure, and context-management strategy). Hold this distinction; it's the root of the scaffold-decoupling problem below.

Outcome vs. process grading

Two scopes, and you generally want both:

Outcome-oriented (state-based). Did the environment end up correct? Did the refund get issued, the file get written, the ticket get closed? This is what users care about and what's hardest to fake — but it tells you nothing about how, so a lucky or reckless success scores the same as a careful one.
Process-oriented (transcript-based). Was the trajectory sound — right tools, right order, no policy violations, no destructive detours? Process grading catches the agent that reached the goal by, say, deleting and recreating a database when it should have run a migration. It's also how you catch reward hacking: an agent that mined the answer from git history reaches the right outcome through an illegitimate process.

Outcome-only scoring is dangerous for exactly the reason it's dangerous in coding evals — once the agent can act in the environment, "right final state" stops implying "did the work correctly."

The three grading methods

Code-based (automatic) graders. Deterministic checks: does the final DB state match, does a test suite pass, does an output string match a reference. Reproducible and cheap — the backbone of any agent eval. Weakness: no nuance; can't judge whether an explanation was good, only whether a value is equal.

Model-based graders (LLM-as-judge). An LLM scores the transcript or outcome against criteria you specify in the prompt. Flexible enough for subjective quality and trajectory soundness. Three scoring modes:

Reference-guided — judge compares output to a known good answer.
Pairwise (preference) — judge picks the better of two outputs.
Direct assessment (pointwise) — judge rates a single output, often on a Likert scale.

Prompting judges with itemized rubrics — decomposing "is this good?" into specific, individually-scored criteria — has become standard practice and materially improves consistency. But model judges are non-deterministic and carry well-known biases (position, verbosity, self-preference), so they must be calibrated against human judgment and monitored for agreement. (Same cautions as in eval infrastructure — they're amplified for agents because there's more to judge.)

Human evaluation. Manual inspection, vibe checks, and calibrated rubrics. It's the north star — but it demands real investment in rubric design, inter-rater agreement, and ongoing calibration, so you reserve it for final sign-off and for calibrating your automated graders.

The mature posture is a composite: simple code graders for the objective parts, model graders (rubric-prompted, human-calibrated) for the subjective parts, combined with predefined weights — and humans auditing the whole thing.

Metrics: pass@k, pass^k, and the consistency gap

This is where agent eval surprises people.

pass@k — probability the agent succeeds on at least one of k independent attempts. Rewards a model that can do the task sometimes.
avg@k — average success rate across k trials.
pass^k ("pass-hat-k") — probability the agent succeeds on all k attempts. Measures consistency.

The gap between pass@k and pass^k is the whole story. An agent can post a strong pass@1 and a respectable pass@k, then collapse on pass^k — meaning it can solve the task but won't do so reliably. The cited example: a model achieving only ~26% pass^4 on the telecom domain of τ²-bench. For anything you'd actually deploy, consistency is the metric that matters — a customer-service agent that's right 1-in-4 runs is unshippable no matter how good its best run looks. Always report a consistency metric, not just pass@1. And consistency is an economic metric too: the resolution rate pass^k measures is the exact denominator of Cost Per Resolution (CPR), so an agent that fails 3-in-4 runs doesn't just frustrate users — it quadruples what each resolved task actually costs you.

Tool-use and trajectory metrics

When you grade the process, tool use decomposes into measurable sub-skills:

Selection accuracy — did the agent choose the right tool for the step?
Invocation accuracy — did it call the tool correctly (right arguments)?
Structural accuracy — was the call well-formed (schema, types)?
Trajectory accuracy — was the overall sequence of calls correct?

These let you localize failures: a high selection but low invocation score points at argument/formatting problems (often a prompting or tool-doc issue), not a planning deficit. This is the eval-side mirror of the runtime concerns in agent serving infrastructure and the interop questions in agent protocols.

The benchmark landscape: τ-bench and Terminal-Bench

Two families anchor 2026 agent evaluation.

The τ-bench family — dynamic, multi-turn, policy-grounded conversations:

τ-bench — an LLM-simulated user converses with an agent in retail and airline domains, each with a policy document and a database the agent must respect and manipulate.
τ²-bench — a dual-control environment where both the user and the agent can change a shared environment (adds a Telecom domain). This is where the pass^k brittleness shows up sharply.
τ²-bench-verified — human verification pass that surfaced and fixed numerous quality issues: policy-compliance ambiguities, conflicting data, and unclear instructions.
τ³-bench — extends to a τ-banking domain requiring autonomous knowledge-base search.

Task curation here is model-in-the-loop with human oversight: schemas auto-populate databases, and designers iteratively refine tasks by reading agent transcripts.

Terminal-Bench — real terminal-based tasks:

Terminal-Bench 2.0 — 89 crowdsourced tasks spanning software engineering, ML, and system administration, run via the Harbor task format and harness.
Terminal-Bench 3.0 — in development.
Quality assurance is unusually rigorous: automated workflow checks, LLM-based quality checks, a manual checklist, three human reviewers, an adversarial exploit agent probing for shortcuts, and a final human sign-off — roughly 3 reviewer-hours per task.

Others worth knowing: GAIA / GAIA-2 (reasoning, web browsing, multimodal), TheAgentCompany (simulated software-company work), WorkArena (enterprise workflows), OSWorld (desktop tasks), MLE-Bench (Kaggle ML problems), PaperBench (reproducing AI research), SpreadsheetBench (Excel), HIL-Bench (human-in-the-loop decisions), and GDPval (economically-valuable tasks).

A recurring theme: frontier models are saturating the earlier benchmarks (several τ-bench domains, Terminal-Bench 2.0), which is exactly why each family keeps shipping harder successors. Treat any single benchmark number as perishable.

The scaffold-decoupling problem

The single most important caveat in agent eval: when you evaluate an agent, you are evaluating the model and the scaffold working together. A disappointing score can come from:

a weak model (poor reasoning/tool use),
a weak scaffold (bad prompts, missing tool docs, poor context management, wrong single/multi-agent structure), or
both.

Naive "Model A vs. Model B" agent comparisons are often really "Scaffold A vs. Scaffold B," and swapping the model into a scaffold tuned for a different one understates it. To attribute a result, hold the scaffold fixed and strong across models, read the transcripts, and use the tool/trajectory metrics above to localize where it broke. Context management is a frequent culprit — context rot (degradation as the conversation grows) is mitigated with summarization, tool-result clearing, note-taking/external stores, and progressive disclosure of context, rather than dumping everything in via static RAG.

A 7-step roadmap for building agent evals

A practical sequence for standing up your own suite:

Define success criteria — both outcome goals (final state) and process goals (acceptable trajectories, policy constraints).
Collect a small initial task set — 10–20 manually curated, realistic tasks. Resist the urge to scale first.
Make tasks high-quality and unambiguous — vague tasks produce uninterpretable scores; τ²-bench-verified exists because ambiguity is the default failure.
Provide ground truth / reference solutions — what the correct final state and (where possible) an acceptable trajectory look like.
Configure graders — start with simple code-based checks; add model-based graders (rubric-prompted) for subjective criteria.
Build an evaluation harness — automated execution of tasks × trials, transcript capture, metric aggregation (including pass^k and cost/latency).
Inspect, iterate, maintain — read transcripts, distinguish capability gaps from task-quality bugs, and keep adding new failure cases. The suite is a living artifact.

Keep both regression tests (legacy tasks you must not break) and new challenging tasks (to fight saturation) in the suite.

Pitfalls and recommendations

Don't trust pass@1. Report a consistency metric (pass^k). The brittleness it reveals is the difference between a demo and a product.
Don't rely solely on LLM judges. They're effective but biased and non-deterministic — calibrate against humans and monitor judge↔human agreement over time.
Don't blame the model by default. Decouple scaffold from model before drawing conclusions; read transcripts.
Don't start multi-agent. Single agents are easier to evaluate and maintain; add structure only when instructions bloat or tools overlap.
Don't treat benchmarks as static. Inspect tasks for ambiguity, policy violations, conflicting data, and exploitable flaws — and assume saturation is coming.
Do invest in data quality continuously. Human eval is the north star, but only if rubrics and inter-rater agreement are maintained.
Do layer your defenses. A "Swiss-cheese" strategy — automated evals + production monitoring + A/B tests + user feedback + cost metrics — catches what any single layer misses.

The take. The headline number on an agent benchmark tells you almost nothing on its own. The signal is in the consistency (pass^k), the transcripts (process, not just outcome), and whether you've actually isolated the model from its scaffold. Build a small living suite, grade the path as well as the destination, and calibrate your judges against humans — then the numbers start to mean something.

FAQ

Q: How is agent evaluation different from normal LLM evaluation? A standard LLM eval grades a single output against a reference. An agent runs a multi-turn loop, calls tools, and changes an environment — so you grade the final state (outcome) and the trajectory (process) across multiple trials, and you fold in cost and latency. See eval infrastructure for the single-model foundations this builds on.

Q: What's the difference between pass@k and pass^k? pass@k is success on at least one of k attempts (capability); pass^k is success on all k attempts (consistency). Agents often look strong on pass@k and weak on pass^k — e.g. ~26% pass^4 on τ²-bench telecom — and consistency is what determines deployability.

Q: Should I use code-based graders or LLM-as-judge? Both. Code graders for objective checks (final state, tests, string match) — reproducible and cheap; LLM judges (with itemized rubrics, calibrated against humans) for subjective quality and trajectory soundness. Combine them into a weighted composite, with humans auditing.

Q: What are τ-bench and Terminal-Bench? τ-bench is a family of dynamic, policy-grounded, multi-turn agent benchmarks (retail, airline, telecom, banking domains; τ², τ³, and a human-verified variant). Terminal-Bench tests real terminal tasks (software eng, ML, sysadmin) via the Harbor harness, with very rigorous per-task QA. Both are saturating, so newer/harder versions keep shipping.

Q: My agent scores poorly — is the model bad? Not necessarily. You're evaluating the model and the scaffold (prompts, tools, context management, structure). Hold a strong scaffold fixed across models, read the transcripts, and use selection/invocation/trajectory metrics to localize the failure before blaming the model.

Q: Should I build a multi-agent system? Start single-agent — they're easier to evaluate and maintain. Move to multi-agent (manager/orchestrator or decentralized peer hand-off) only when a single agent's instructions become bloated or its tools have overlapping purposes.

References

Cameron R. Wolfe — "Agent Evals" — detailed practitioner guide to evaluating AI agents. cameronrwolfe.substack.com/p/agent-evals.
τ-bench — Yao et al., 2024. "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045.
Terminal-Bench — Stanford / Laude Institute. Terminal-based agent benchmark and Harbor harness. tbench.ai.
GAIA — Mialon et al., 2023. "GAIA: A Benchmark for General AI Assistants." arXiv:2311.12983.
SWE-bench — Jimenez et al., 2023. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770.
LLM-as-Judge — Zheng et al., 2023. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685.
ReAct — Yao et al., 2022. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629.
Model Context Protocol (MCP) — Anthropic's open standard for tool/context interfaces. See AI agent protocols.

Table of contents