LLM Evaluation Infrastructure: The Complete Guide
The definitive guide to evaluating LLMs honestly: why aggregate benchmarks lie, how contamination distorts scores, the protocol sensitivities most papers don't report, agentic evals, and what credible workload-specific evaluation looks like.
A benchmark number is a summary statistic over a fixed dataset evaluated with a fixed protocol. Each of those three words — summary, fixed, protocol — hides assumptions that turn out to matter enormously when you try to compare models or predict how one will behave on your workload.
The field is awash in benchmark numbers. Press releases tout single-percentage improvements. Leaderboards reshuffle weekly. The signal-to-noise ratio of public benchmarks is the worst it's ever been, even as serious evaluation has become more important than ever.
The take: public benchmarks are marketing; workload evals are engineering. Treat them differently. Aggregate scores on contaminated public benchmarks (any benchmark public long enough to matter is contaminated, per the contamination literature cited below) are coarse signal for tracking the field. They are not a basis for production decisions. The only number that reliably predicts how a model will behave on your workload is a measurement of how the model behaves on your workload. If you don't have that, your decisions are based on the wrong evidence.
This is an engineering playbook for defending a model-selection decision in a room where the answer matters: what public benchmarks (HELM, MMLU-Pro, GPQA, SWE-bench, LiveCodeBench, Chatbot Arena, FrontierMath) actually measure and where they break; static leaderboards vs live arenas; contamination as a quantifiable phenomenon; the protocol sensitivities that explain why two papers report different numbers for the same benchmark; statistical practice that survives peer review; and the discipline of building internal eval harnesses that predict deployment behavior. Pair with LLM serving, agent serving, reasoning model serving, and synthetic data and distillation to close the loop from eval signal to training and serving decisions.
Table of contents
- Key takeaways
- Mental model: LLM evaluation in one minute
- The eval landscape in 2026
- What a benchmark actually measures
- Static benchmarks vs live arenas
- Pass@k vs single-shot scoring
- Contamination and how vendors handle it
- Protocol sensitivity
- Goodhart's law and metric targeting
- Public benchmarks: what they're good for
- Building an internal eval harness
- Building workload-specific evals
- Holistic vs narrow evals
- Evaluating long-form output
- Evaluating agentic behavior
- Evaluating safety and alignment
- Statistical practice
- Continuous evaluation in production
- Open problems
- LLM-as-judge: when it works, when it breaks, how to calibrate
- Cost-and-throughput math for an eval harness
- Eval CI/CD: gating model releases on the harness
- Eval stack tour: lm-eval-harness, HELM, OpenAI Evals, Inspect, DeepEval, Promptfoo
- Eval platforms: LangSmith, Braintrust, Confident AI, Galileo, Patronus
- Benchmark deep dive: MMLU-Pro through MIXEVAL
- Trace-replay infrastructure for production debug
- Regression eval CI/CD: gating policies, threshold setting
- Domain-specific evals: medical, legal, finance, coding, support
- The eval leaderboard meta and Goodhart in practice
- LLM-as-judge calibration: position bias, length bias, judge upgrade
- Eval set construction methodology
- Private internal evals: golden sets and A/B preference data
- Benchmark taxonomy: reference, judge, programmatic, human
- Open evaluation problems in 2026
- Benchmark contamination: detection and remediation
- Statistical power and confidence intervals
- Evaluation cost economics
- Safety and red-team evals: HarmBench, AILuminate, WMDP, XSTest
- Multi-modal eval: vision, audio, video
- A/B testing in production: routing, interleaving, holdouts
- Reasoning-model eval challenges
- RAG evaluation: RAGAS, FaithfulnessQA, retrieval metrics
- Agent evaluation: GAIA, BrowseComp, OSWorld, tau-bench
- The production eval feedback loop
- Running an eval team: roles and responsibilities
- Eval data governance and labeling pipelines
- Eval observability: dashboards, alerts, regression detection
- Cross-model eval portability and the multi-provider future
- The bottom line
- FAQ
- Glossary
- References
Key takeaways
- Public benchmarks are increasingly contaminated. Any benchmark that's been public long enough to be tested rigorously is in some model's training data.
- A benchmark number depends heavily on the protocol (prompt template, decoding params, parsing). Two papers can report different scores on "the same" benchmark.
- Aggregate scores hide tail behavior. Models with identical headline numbers can behave very differently on hard items.
- Goodhart's law: once a benchmark becomes a target, optimization erodes its correlation with capability.
- Workload-specific evals built from your actual traffic are what tells you something useful about deployment performance.
- Agentic and long-form evaluation are the hardest current problems. Both are still immature.
- Recommendation: trust public benchmarks for coarse comparison, your own evals for production decisions, and statistical rigor for both.
Quick comparison: eval approaches
| Approach | What it measures | Cost per run | Determinism | Best for |
|---|---|---|---|---|
| Public benchmark (MMLU, etc.) | Coarse capability | Low (cached) | High (greedy) | Marketing, coarse model selection |
| Held-out private set | Generalization on a domain | Low | High | Tracking regressions on a known slice |
| Workload replay (traces) | Production behavior | Medium | Medium (sampled) | Pre-deploy gates, regression detection |
| LLM-as-judge | Long-form quality, style | Medium-high | Low | Open-ended generation, agent outputs |
| Human review | Hard-to-specify quality | Very high | Medium | Final sign-off on safety-critical tasks |
| Agent rollout eval | Multi-turn task success | High | Low | Tool-using and reasoning agents |
| Reward-model scoring | Preference proxy | Medium | Medium | Post-training feedback loops |
This guide sits next to the rest of the serving and training stack: LLM serving for the inference path you're testing, agent serving infrastructure for trace-based evals on tool-using systems, reasoning model serving for evaluating long-CoT outputs, post-training for closing the loop from eval signal into model updates, and synthetic data and distillation for using eval failures as training data.
Mental model: LLM evaluation in one minute
The problem has a name: the offline/online gap. Your benchmark says model A wins; production says model B. The gap is what separates "we ran an eval" from "we ran an eval that predicts deployment behavior." Almost everything else in this guide — contamination, protocol sensitivity, judge calibration, agent rollouts — is a tactic for shrinking that gap.
Think of evaluation as signal separation, not score reporting. A benchmark number is a mixture: true capability + protocol artifact + dataset contamination + sampling noise + judge bias. Aggregate scores collapse those terms; honest eval keeps them separate. The job is to design a harness where the residual after subtracting the noise terms is small enough to make decisions on.
| Aspect | Public benchmark (offline) | Workload eval (online proxy) |
|---|---|---|
| Items | Frozen public dataset | Sampled from your traffic |
| Contamination risk | High after 6–12 months | Effectively zero |
| Protocol stability | Set by vendor, often undocumented | Pinned by you |
| Decision relevance | Coarse field-tracking | Production gate |
| Cost per run | Low (cached) | Medium (your eval pipeline) |
| Goodhart risk | Severe once it's a target | Limited to your own optimization |
The production one-liner that ties the loop together looks like this:
# pin the protocol, separate the signals
results = harness.run(
model=candidate,
suite=workload_suite, # your traffic, stratified
decoding={"temperature": 0.0, "max_tokens": 1024},
judge=judge_model, judge_seed=42,
n_samples_per_item=3, # variance estimate
)
gate = ci_lower_bound(results) > baseline_metric # ship/no-ship
The sticky number: MT-Bench inter-judge agreement runs around 81% between GPT-4-class judges and trained human raters (Zheng et al., 2023). That is the ceiling for LLM-as-judge as a substitute for humans on chat-style tasks — high enough to be useful, low enough that a 1-point MT-Bench delta is noise, not signal. Any eval claim that ignores this number is over-reading its own data.
The eval landscape in 2026
By 2026 the eval ecosystem has split into four mostly-independent layers, and confusion between them is the single biggest source of bad arguments.
Layer 1 — static academic benchmarks. The lineage of HELM (Liang et al., 2022), BIG-bench (Srivastava et al., 2022), and MMLU (Hendrycks et al., 2020). These are large fixed datasets with frozen items. They are heavily contaminated for any modern frontier model, but they are the only artifacts that allow comparison to historical numbers. MMLU-Pro (Wang et al., 2024) is the canonical "harder MMLU" successor. GPQA-Diamond (Rein et al., 2023) is the canonical graduate-level science benchmark. FrontierMath (Glazer et al., 2024) is the current "we still can't solve this" math benchmark, with items written by professional mathematicians and held out from public release.
Layer 2 — live human-preference arenas. Chatbot Arena (Chiang et al., 2024), run by LMSYS, is the dominant entry. Users blind-vote between two model responses; an Elo system aggregates. AlpacaEval, MT-Bench (LLM-as-judge), and vendor-hosted equivalents like Vellum, Scale's SEAL leaderboard, and Artificial Analysis sit alongside. Live arenas resist contamination by construction (prompts are not published), but they're biased toward chat-style preferences and reward verbosity and confident tone.
Layer 3 — code and agent benchmarks with execution feedback. HumanEval (Chen et al., 2021) was the original; it is now thoroughly contaminated. LiveCodeBench (Jain et al., 2024) addresses contamination by rolling its problem window monthly. SWE-bench (Jimenez et al., 2023) and SWE-bench Verified are the canonical agent-coding benchmarks: real GitHub issues, real test suites, real patches. Aider's polyglot benchmark and TerminalBench sit alongside.
Layer 4 — internal eval harnesses. Every serious deployment runs its own. These are not benchmarks in the academic sense; they are workload-conditioned regression suites. They are the only numbers that matter for production decisions.
The harness ecosystem. EleutherAI's lm-evaluation-harness is the de-facto standard for reproducing static benchmarks. HELM is the most rigorous protocol-document framework. OpenAI's evals, Anthropic's internal eval tooling (partially open-sourced as inspect_ai via the UK AISI), and frameworks like Promptfoo, Braintrust, LangSmith, and Weights & Biases Weave handle the workload-eval layer. For agent eval specifically: AgentBench, OSWorld, WebArena, and the SWE-bench harness.
Quick comparison: harness frameworks
| Framework | Primary use | Async support | Trace integration | Cost |
|---|---|---|---|---|
lm-evaluation-harness |
Reproducing public benchmarks | Limited | Minimal | OSS |
| HELM | Protocol-rigorous comparisons | Limited | Strong methodology | OSS |
inspect_ai |
Workload + agent evals | First-class | Built-in | OSS |
OpenAI evals |
Workload evals | Yes | OK | OSS |
| Promptfoo | Prompt engineering iteration | Yes | OK | OSS / hosted |
| Braintrust | Hosted workload evals | Yes | Strong | Hosted (paid) |
| LangSmith | LangChain-integrated evals | Yes | Strong | Hosted (paid) |
| W&B Weave | Integrated obs+eval | Yes | Strong | Hosted (paid) |
| Vellum | Enterprise evals | Yes | Strong | Hosted (paid) |
Default pick for new harness work in 2026: inspect_ai as the framework, with a custom item store and scoring scripts. If you already use LangSmith or Braintrust for observability, the integrated eval features are often good enough to avoid running two systems. Public-benchmark reproduction is lm-evaluation-harness.
Who runs what. Frontier labs (Anthropic, OpenAI, Google DeepMind, Meta, xAI, DeepSeek, Alibaba Qwen) publish on all four layers. Application teams should mostly ignore Layer 1, watch Layer 2 weekly, gate releases on Layer 3 if relevant, and build Layer 4. The rest of this guide is mostly about Layer 4, with the static benchmarks treated as context.
What a benchmark actually measures
A benchmark has three components, each carrying assumptions:
The items. A list of inputs the model must respond to. Sampled from some distribution — typically whatever the benchmark's authors thought was important.
The scoring rule. A function from (model output, item) to a score. Exact-match, multiple-choice accuracy, model-graded similarity, etc.
The aggregation. A way to combine per-item scores into one number, usually a mean.
A benchmark score is a function of all three. Change any of them and the number changes.
What this means in practice
- A model that's great on the benchmark's specific distribution may not be on yours.
- A model that gets full credit for matching a canonical answer may produce other correct answers that get zero.
- A model that aces easy items and fails hard ones gets the same score as one with the opposite profile.
Two models with the same benchmark score on the same dataset can still diverge sharply on real workloads. The benchmark is a lossy summary.
Static benchmarks vs live arenas
The two dominant evaluation paradigms in 2026 are static benchmarks and live arenas, and they answer different questions.
Static benchmarks
A static benchmark is a fixed dataset evaluated with a fixed protocol. MMLU-Pro, GPQA-Diamond, SWE-bench, HumanEval, MATH, GSM8K, LiveCodeBench (rolling-window static), FrontierMath. The benchmark publishes items, gold answers, and a scoring script. Anyone can run the same eval and get a comparable number, provided they use the same protocol.
Strengths:
- Reproducible. A claimed score can be verified.
- Comparable across labs and across time.
- Cheap to run after the first time.
Weaknesses:
- Contamination accumulates. Any benchmark public for more than a year is partially in the training set of any well-resourced model.
- Goodhart pressure: labs optimize for the benchmark.
- Fixed distribution: doesn't track new use cases.
Live arenas (Chatbot Arena, LMSYS, AlpacaEval, Vellum)
A live arena solicits judgments on novel inputs. Chatbot Arena, run by LMSYS, is the canonical example: a user types a prompt, two anonymized models respond, the user picks the better one. Aggregating millions of these votes via the Bradley-Terry model produces Elo-style ratings. The full methodology is in Chiang et al., 2024.
AlpacaEval automates this with LLM-as-judge on a fixed prompt set, calibrated against human preferences. Vellum, Artificial Analysis, and Scale SEAL operate proprietary equivalents focused on enterprise tasks. Internal A/B tests at scale (e.g., what providers run during a model rollout) are private live arenas.
Strengths:
- Reflects what users actually want from a chat model.
- Contamination-resistant: prompts aren't published in bulk.
- Updates continuously.
Weaknesses:
- Reward verbosity, confident style, formatting tricks that don't reflect underlying capability.
- Bias toward chat use cases over agentic, long-form, or code.
- User population isn't representative of any single deployment's users.
- Cannot be reproduced by a third party.
Which to trust for what
| Question | Use |
|---|---|
| Has model capability improved meaningfully? | Static benchmarks (MMLU-Pro, GPQA, FrontierMath) |
| Does this model "feel" better in chat? | Chatbot Arena |
| Will it ship a working code patch? | SWE-bench / LiveCodeBench |
| Will it work on my customer-support prompts? | Internal eval |
| Is the headline arena ranking real or stylistic? | Control for length and style |
Healthy practice is to look at all three layers (static, arena, internal) and treat disagreement between them as informative. A model that's #1 on Arena but middling on GPQA is a chat-tuning win, not a capability win. A model that crushes GPQA but ranks low on Arena is competent but stylistically off-putting.
Pass@k vs single-shot scoring
For tasks with verifiable answers (code, math, structured outputs), there is more than one way to measure "did the model solve it." The choice changes leaderboard order.
Pass@1 (single-shot)
The model generates one attempt per problem. Score is the fraction solved.
- Cheapest. Most leaderboards default here for the headline.
- High variance on sample sets of a few hundred items.
- Sensitive to temperature and other decoding parameters.
Pass@k
The model generates k attempts per problem. The problem counts as solved if any attempt passes. HumanEval's original paper (Chen et al., 2021) defined an unbiased estimator for pass@k from a larger sample of n attempts.
- pass@10 and pass@100 measure the model's coverage — the breadth of solutions it can produce.
- A model with high pass@1 but low coverage is brittle. A model with low pass@1 but high pass@10 has the ideas but can't pick them.
- Real production deployments rarely sample 10 times per query, so pass@1 is the operationally relevant number. But pass@k informs how much best-of-N or self-consistency will help.
Maj@k (self-consistency)
Generate k attempts, take the most common answer. For math and multiple-choice this is competitive with pass@k at the same compute.
Best-of-N with a verifier
Generate N, use a verifier (test suite, reward model, judge) to pick. Distinct from pass@k because the verifier may pick a wrong answer. The ceiling is pass@N; the floor is pass@1.
What to report
For a credible eval write-up: pass@1 with a stated temperature, plus pass@k for at least one larger k if compute allows, plus the standard error on each. Single-temperature pass@1 with no confidence interval is the minimum threshold for taking a number seriously.
Contamination and how vendors handle it
Models are trained on web-scraped corpora. Benchmarks are published on the web. The intersection grows over time.
A model that has seen the benchmark's items during training will score higher on them than its actual capability warrants. The effect is real, measurable, and rarely accounted for in headline numbers.
How big is the effect
Estimates vary. Documented contamination effects range from negligible (some benchmarks, some models) to dramatic — 10+ point inflation on aggregate scores.
The problem is that you don't know which case you're in without careful analysis. The benchmark's authors can release contamination reports; not all do.
Mitigations
Held-out items. Some items kept private. Works only as long as they stay private — eventually they leak.
Recent benchmarks. Created after a model's training cutoff. Works briefly, but as the model is retrained, the freshness window shrinks.
Decontamination. Filter training data to remove known benchmark items. Catches exact matches; misses paraphrases.
Behavioral checks. Compare model behavior on original items vs perturbed versions. A model that memorized will perform much better on the original. Large gaps suggest contamination.
Decision-relevant deltas. If two models score within 2 points on a contamination-suspect benchmark, that delta is probably below the contamination noise floor. Don't make decisions on it.
The honest position
Any benchmark that's been public for more than a year, in a domain that's been widely scraped, is partially contaminated for any well-resourced model trained on web data. Some are more contaminated than others.
Treat benchmark numbers as upper bounds when contamination is plausible.
How specific vendors handle contamination
The major labs publish contamination protocols that range from rigorous to gestural.
- Anthropic publishes contamination analyses in some model cards: substring matches between benchmark items and training data, with reported decontamination passes before training.
- OpenAI has discussed contamination protocols in GPT-4 and later technical reports, generally via 50-character substring matching against benchmark items.
- DeepSeek publishes contamination analyses in technical reports, including on R1's training data (arXiv:2501.12948).
- Meta publishes contamination scores per benchmark in Llama model cards, distinguishing exact matches from n-gram overlaps.
- Google DeepMind runs the most thorough public protocol on Gemini technical reports, including held-out replicas of public benchmarks.
What contamination scores don't capture: paraphrase contamination, contamination via discussion of benchmark items in tutorials and blog posts, and contamination via synthetic data generated from teacher models that themselves saw the items. These are real effects that no published methodology fully addresses.
The freshness arms race
Benchmarks designed to resist contamination have a built-in expiration. LiveCodeBench rolls its problem window. FrontierMath holds items private. Chatbot Arena uses unpublished user prompts. The Humanity's Last Exam benchmark is constructed from new expert-written items each year.
The cost of freshness is comparability. A benchmark whose items change over time can't be compared cleanly across model generations. The field is gradually accepting this trade.
Protocol sensitivity
Two papers can run "the same" benchmark and report different numbers. The protocol matters.
Where protocols differ
Prompt template. Few-shot examples or zero-shot? Which examples? In what format?
Decoding parameters. Temperature 0 (greedy) or sampling? Top-p, top-k? Beam search?
Output parsing. How is a free-form completion reduced to a label? What if the model declines to answer?
System prompt. Yes or no? What content?
Re-tries. Does the harness re-prompt if the output is malformed?
For a well-tuned model, swapping the prompt template can move scores by several points on a standard benchmark. Different harnesses (lm-eval-harness vs HELM vs custom) often produce systematically different numbers.
What this means
A benchmark number without a documented protocol is approximate at best, misleading at worst.
When comparing models from different papers / press releases, check that the protocol is the same. If they don't say, assume incomparability.
Common protocol gotchas
- Multiple-choice benchmarks: extracting the answer letter is non-trivial when the model writes paragraphs.
- Math benchmarks: equivalence (e.g., "0.5" vs "1/2") is hard to detect.
- Code benchmarks: which test cases? In what environment? With what compile flags?
The benchmark's published protocol should specify these. In practice many don't.
Goodhart's law and metric targeting
Goodhart's law: when a measure becomes a target, it ceases to be a good measure.
Applied to LLM benchmarks: once a benchmark is widely cited, optimization pressure flows toward it. Training mixes shift, fine-tuning data targets the benchmark's style, evaluation feedback tunes the model to its format. The benchmark's score rises faster than underlying capability.
Concrete manifestations
Format optimization. A model trained to answer multiple-choice in a specific way scores higher than its underlying knowledge warrants.
Training-data overweighting. The model sees disproportionate amounts of benchmark-like data, becoming a specialist in the benchmark's distribution.
Off-target degradation. Heavy optimization toward a narrow benchmark can degrade performance on adjacent tasks the benchmark doesn't cover.
Defenses
Rotating benchmarks. New ones replace old. Buys time before Goodhart sets in.
Held-out adversarial items. Items designed to defeat memorization or shallow pattern matching.
Composite metrics. Many benchmarks together; harder to game all simultaneously.
Process metrics. Not just final accuracy but reasoning quality, calibration, refusal behavior.
None of these fully solves it. Goodhart's law is structural: any public number eventually gets gamed.
Public benchmarks: what they're good for
Public benchmarks are not useless. They're useful for specific purposes.
Good uses
- Coarse comparison. A model 10+ points higher on a serious benchmark is probably actually more capable. Even with contamination noise, that's signal.
- Tracking the field's trajectory. Aggregate scores across benchmarks over years tell a real story about progress.
- Sanity-checking custom evals. If your custom benchmark gives a counterintuitive result, comparing on a public one can catch evaluation bugs.
- Common vocabulary. When discussing models with peers, "scores 78 on MMLU-Pro" is a more useful shorthand than long descriptions.
Bad uses
- Fine-grained ranking among similar models. Differences of <3 points are within protocol and contamination noise.
- Predicting workload-specific performance. A benchmark's distribution is probably not your distribution.
- Justifying production decisions. "Model X scores higher on benchmark Y, so we should ship it" is rarely sound on its own.
Worth knowing in 2026
- MMLU-Pro: harder successor to MMLU, partly addresses Goodhart on the original.
- GPQA: graduate-level Q&A. Less prone to memorization than older benchmarks.
- LiveCodeBench: rolling-window coding evaluation that turns over to avoid contamination.
- HELM: comprehensive multi-task framework with explicit protocols.
- BIG-Bench Hard: challenging subset of BIG-Bench.
- AIME / MATH / GSM8K: math benchmarks (heavily contaminated by now).
- HumanEval / MBPP: code benchmarks (also contaminated; LiveCodeBench is the freshness response).
- MT-Bench / Chatbot Arena: human preference-based evaluation.
- SWE-bench: agent-style real-world coding tasks.
This list rotates quickly. The benchmarks worth caring about in 2027 will partly differ.
Building an internal eval harness
The internal eval harness is the highest-leverage piece of evaluation infrastructure most teams own. Done well, it answers "should we ship this model?" in hours. Done badly, it answers nothing reliably and the team falls back on vibes.
What an internal harness is
A reproducible system that runs a set of evaluations against a set of model endpoints and emits a structured report. It has four components:
- An item store: prompts and gold answers (or rubrics), versioned, with metadata (difficulty, segment, source).
- A runner: connects to model endpoints, executes items with controlled protocol (temperature, system prompt, retry policy, tool stubs), records raw outputs.
- A scorer: applies the right scoring rule per item type (exact match, judge, test execution, rubric).
- A reporter: aggregates and presents results — by stratum, with confidence intervals, with diffs against baseline.
Build vs buy in 2026
Open-source frameworks worth knowing:
inspect_ai(UK AISI) — well-designed Python framework, strong async support, used by AISI evaluators. Good default for new harness work.lm-evaluation-harness(EleutherAI) — the reference for reproducing static benchmarks.- OpenAI
evals— the original public evals framework; usable but heavier. promptfoo— config-driven evals, good for prompt-engineering iteration.- Braintrust, LangSmith, Weights & Biases Weave, Helicone — hosted SaaS, integrate observability with eval. Pay for trace storage and UI.
Default recommendation: inspect_ai for the framework, with your own item store and your own scoring scripts. Most production harnesses end up as a wrapper around one of the open frameworks with a custom item loader and a custom report.
Items the harness must run
A working harness for a chat-style product covers:
- Regression items: 100-500 hand-curated prompts that previously revealed bugs. Most valuable single category. Every release runs these.
- Workload sample: a stratified sample from production traces. Bigger (500-5000 items), refreshed monthly.
- Capability slices: small benchmark-style sets specific to your product (entity extraction, summarization, format adherence).
- Safety battery: jailbreak attempts, PII probes, refusal triggers. Gated.
- Performance items: long contexts, structured outputs, tool calls — items that stress the serving path as much as the model.
Reporting that survives skepticism
A harness report that wins arguments includes: per-stratum results with confidence intervals, diffs against a frozen baseline (last shipped model), explicit notes on which items changed verdict, raw outputs for any item that regressed, and links to the trace store so anyone can spot-check.
Without that, leadership reads two numbers and picks one. With it, the conversation moves to specific items, which is where the right decisions get made.
Cost
A serious harness costs $50k-$500k/year to operate at a frontier-adjacent application company: engineering time, API calls, human review for rubric items, judge-model calls. The first model-selection decision it informs typically pays for it many times over.
Building workload-specific evals
If you want to know how a model will perform on your workload, build evals from your workload.
Steps
1. Define the task. What does the model need to do for your users? Be specific. "Answer questions" is too vague; "summarize a contract clause for a legal-ops user" is workable.
2. Sample items. Pull representative inputs from production traffic (with appropriate privacy considerations). Don't curate to "interesting" examples; capture the distribution.
3. Stratify. Group by difficulty, by user segment, by length, by domain. The aggregate score should be informative, but per-stratum scores are where decisions get made.
4. Define scoring. What counts as success? Decide before generating model outputs. Different options:
- Exact-match: works for narrow tasks.
- Rubric-based: human or model judge with explicit criteria.
- Comparison-based: pairwise vs reference output.
- Downstream task success: did the user's downstream action succeed?
5. Hold out items. Never share items with model-vendor APIs you don't trust, never put them in public docs. Privacy of your eval set is itself a useful property.
6. Maintain. Refresh items periodically; distributions drift.
What "representative" means
The temptation is to pick "good" examples. Resist it. A workload eval should reflect the real difficulty distribution of your traffic, including the boring 80% and the painful 5% tails.
A common workflow:
- Random sample 200-500 items.
- Hand-label difficulty / category.
- Stratify by labels.
- Evaluate per-stratum.
Cost
Workload-specific eval is more expensive than reading leaderboards. Plan for:
- Engineering time to build and maintain the eval harness.
- Possibly human-labeled scoring if rubrics require it.
- Model-API costs to evaluate candidate models.
For a production deployment serving meaningful traffic, this cost is rounding error compared to the cost of shipping a worse model.
Working examples
The eval that actually moves decisions usually looks more like the following. A document-Q&A product runs a workload eval with these strata:
- Single-paragraph factual recall (n=120): exact-match on a known span. Catches retrieval regressions.
- Multi-document synthesis (n=80): LLM-as-judge with rubric. Calibrated quarterly against human ratings on a 20-item sample.
- Structured output (n=60): JSON-schema validity plus field-level accuracy. Catches format drift.
- Long-context (32k+) (n=40): needle-in-haystack plus harder multi-hop. Catches long-context regressions.
- Refusal / safety (n=50): graded by rule. Hard gate.
Each release runs all of these, with confidence intervals. The reporter highlights any stratum where the new model's interval is below the baseline's interval. Conversations move to specific failing items, not aggregate scores.
A code-assistant product's workload eval typically has SWE-bench-style execution items, format-adherence items (does the model emit a usable diff), and tool-use items (does the model use the search tool correctly). The execution items dominate, since they're the only verifiable layer.
Holistic vs narrow evals
Two complementary evaluation styles:
Holistic evals
Aggregate scores across many tasks. MMLU-Pro, HELM, BIG-Bench.
- Tell you about general capability.
- Useful for marketing and model-generation comparisons.
- Less useful for product decisions.
Narrow evals
Specific tasks evaluated thoroughly. "Can the model reliably produce valid JSON for our schema?" "Does the model refuse to leak user PII?"
- Tell you about deployment readiness.
- Less useful for tracking capability over time.
- Essential for product decisions.
Most production teams converge on a portfolio: a small number of holistic evals as context, a larger number of narrow evals as gates, and a small number of red-team / safety evals as critical gates.
Evaluating long-form output
Most benchmarks score short, structured outputs. Evaluating long-form generation — essays, code, plans, reports — is much harder.
Approaches
Model-graded scoring. Another model evaluates outputs against criteria. Cheaper than human evaluation, but introduces biases (judge models prefer their own style; some judges are more lenient).
Pairwise comparison. Judge picks which of two outputs is better. Lower bias than absolute scoring. Used by Chatbot Arena and similar.
Rubric-based. Detailed criteria the judge checks. Reduces variance vs free-form judgment.
Human evaluation. Most reliable, most expensive. Often the gold standard for new evaluation methods.
Biases in model-graded scoring
- Length bias: longer responses often rated higher even when not better.
- Position bias: the first option in a pairwise is often preferred.
- Self-preference: a judge model may prefer outputs that look like its own.
- Verbosity bias: more confident-sounding answers rated higher.
Good model-graded evaluation accounts for these (randomize positions, control for length, use multiple judges, sanity-check vs human ratings).
Evaluating agentic behavior
Evaluating a model's ability to use tools, take multi-step actions, and recover from errors is harder than evaluating Q&A.
What's different
Agentic evaluation requires:
- An environment, not a dataset.
- Multi-turn evaluation, with each turn's correctness affecting later turns.
- Tools the agent can call.
- A way to measure ultimate task success, not just per-step quality.
Benchmarks
- SWE-bench: real GitHub issues; agent must produce a patch that passes tests.
- WebArena: browser-based agent tasks.
- AgentBench: collection of agent evaluation environments.
- OSWorld: operating-system-level agent tasks.
These are the early generation of agentic evals. Quality varies. Reproducibility is a real challenge (the environment matters; small changes affect scores).
Open issues
- Stability over time. Software changes break agent evaluations. Maintenance cost is high.
- Cost. Multi-turn agentic eval is expensive — many API calls per task.
- Coverage. Existing benchmarks cover narrow slices of "agentic capability." Real production agent behavior is harder to capture.
For production agent systems, workload-specific evaluation built from your own task distribution is even more valuable than for chat systems.
What "good" looks like for an agent eval in 2026
A production-grade agent eval suite has these properties:
- Real environments, not simulations. A SWE-bench-style harness that actually runs the tests, not a model judging whether a patch "looks right."
- Multi-turn rollouts with full traces. Every tool call, every observation, every reasoning step captured.
- Per-step and end-task metrics. End-task success for the headline; per-step diagnostics for debugging.
- Replay against new models. Same trace can be re-run with a candidate model to estimate impact without re-doing the human-graded items.
- Cost tracking. Each item logs $ cost. Optimizing the cost-quality frontier requires knowing the cost.
- Reproducibility. Same eval should produce comparable results when re-run; not perfectly deterministic but tightly bounded.
Most public agent benchmarks fail at least three of these. The internal harnesses at frontier labs hit all six. The gap is engineering investment, not novel research.
Evaluating safety and alignment
Distinct evaluation track focused on undesirable behavior.
Categories
- Harmful content generation: jailbreaks, bypasses.
- Hallucination: fabricated facts presented confidently.
- Bias: differential treatment across demographic groups.
- Persuasion / manipulation: undue influence on user beliefs.
- Capability disclosure: revealing things the model shouldn't (sometimes called "dual-use eval").
- Sandbagging: deliberately underperforming on certain tasks.
Approaches
- Static red-team datasets: known-bad prompts; check refusal rate.
- Adversarial generation: another model attempts to elicit bad behavior.
- Behavioral consistency checks: model's behavior across rephrasings or jailbreak attempts.
- Calibration evaluation: does the model's confidence track its accuracy?
Pre-deployment gates
Many production deployments treat safety evaluations as hard gates: a new model can't ship without meeting thresholds. This is more disciplined than aggregate capability evals.
Public safety benchmarks worth tracking
- HarmBench (Mazeika et al., 2024): standardized red-teaming benchmark with multiple attack types.
- AILuminate (MLCommons): industry-consortium benchmark covering hazardous categories.
- AdvBench: adversarial prompt set.
- WMDP (Li et al., 2024): weapons-of-mass-destruction proxy benchmark for capability disclosure.
- PromptGuard / PromptInject: prompt-injection benchmarks.
For production gating, these are necessary but not sufficient. Internal red-team panels supplement them with attacks that haven't yet leaked into training data. The discipline of treating safety as a hard gate (not a "nice to have") is what separates serious deployments from optimistic ones. See production safety guardrails for the runtime defenses that complement eval-time safety gates.
Statistical practice
A benchmark score has uncertainty. Treating point estimates as exact is a common error.
Things to do
Run multiple seeds. Sample many times, report distributions, not points.
Report confidence intervals. A 78.2% accuracy with 95% CI of [76.4, 79.9] is more informative than "78.2%."
Bootstrap or permutation tests for comparisons. Is the difference between two models statistically significant given the sample size?
Power analysis. How large does your eval set need to be to detect the smallest difference you care about?
Things to avoid
- Reporting "Model A is better by 0.5 points" when the within-model variance is 1.5 points.
- Comparing models tested with different protocols.
- Choosing the seed that makes your favorite model look best.
The number that matters
For decisions, what matters is whether the difference is meaningful, not whether it exists. A statistically-significant 0.2-point gain is irrelevant; a 10-point gain even with no statistical analysis is decisive.
Sample-size math worth memorizing
For a binary score on N items, the standard error of the mean is roughly sqrt(p(1-p)/N). At p=0.5 and N=100, SE is 0.05 — a single eval point is ±5 points wide at one sigma. At N=400, ±2.5 points. At N=1600, ±1.25 points. The cost of detecting a 2-point regression with statistical confidence is several hundred items.
For paired comparisons (the same items run on two models), the McNemar test or paired bootstrap gives tighter intervals because the per-item difficulty cancels. Paired evaluation is the right default; running unpaired evaluations and comparing them is a common error.
Bootstrap in three lines
For a list of per-item scores xs, the bootstrap confidence interval is: sample N items with replacement, compute the mean, repeat 10,000 times, take the 2.5th and 97.5th percentiles. Cheap, generic, no distributional assumptions. Any harness that doesn't emit a bootstrap CI is undersupplying signal.
Multiple comparisons
Running 50 stratified evals on a new model will produce a few that "look worse" by chance even if the model is identical to baseline. Apply Bonferroni or Benjamini-Hochberg corrections before flagging regressions in a multi-stratum dashboard. Most internal harnesses don't, and most "regressions" reported in that context are noise.
Trace replay: the workflow that scales
The single highest-leverage eval workflow that mature production teams have converged on is trace replay. Worth its own section because the abstract sounds simple ("re-run captured traces against a new model") and the implementation has subtleties that matter.
What trace replay is
Capture every request → response from production (or a representative slice of it) into a trace store. Each trace records: input prompt, system prompt, tool list, model used, exact decoding parameters, all turns, all tool calls, all tool responses, final output. When you have a candidate new model, replay each trace through the new model with the same protocol and compare outputs.
What it tells you
Replay lets you answer the most important pre-deployment question: "would this candidate model behave better, worse, or differently on what our users actually do?" — without needing a workload eval that's already curated. The trace store is the workload eval.
Where it gets subtle
- Tool outputs are non-deterministic. Search returns change, APIs return different responses. Replay can either re-execute tools (which gives different results) or replay the captured tool outputs (which doesn't reflect what the new model would actually do). Both options have failure modes.
- Time-dependent prompts. If the system prompt includes a date or current state, replaying months later changes behavior in ways unrelated to the model.
- Counterfactual model trajectories. A new model might take a different tool path than the captured trace. Forcing it down the original path misses what the new model would actually do.
The practical compromise: replay tool outputs (for reproducibility) on the first run, then do a smaller sample of free-running replays (re-executing tools) on representative traces. The first answers "would the new model produce a better answer given the same context?"; the second answers "would the new model's overall behavior be better?"
Diffing replay outputs
Diff-style comparison between old-model and new-model outputs is more useful than aggregate scoring. The eval interface should let a reviewer see, item-by-item: input, old output, new output, diff, score-delta, with the ability to flag regressions and improvements. Most engineering effort goes into making this interface actually usable; without it, replay results pile up and don't inform decisions.
Trace privacy and sampling
Production traces contain user data. Replay infrastructure must handle PII, access controls, retention. Practical pattern: sample 1-5% of production traffic with explicit user consent (or after careful legal review of your terms), aggressively redact PII before storage, and limit replay access to the eval team. See agent serving infrastructure for the trace-storage side of this.
Continuous evaluation in production
A model that worked yesterday can subtly drift today. Especially for agent systems and hosted-model deployments where the model itself changes underneath you.
What to monitor
- Quality regression: per-task scores on your workload eval.
- Latency regression: TTFT, ITL, end-to-end task latency.
- Refusal rate: rate at which the model declines to answer.
- Error rate: structured-output failures, tool-call failures.
- User-feedback signals: thumbs, retries, abandoned conversations.
Cadence
- Continuous (on every request, sampled): latency, error rates.
- Periodic (daily / weekly): quality evals on workload sample.
- On model-version change: full re-evaluation.
Alerting
Set thresholds. A 5-point regression on a key task should page someone, not silently accumulate.
Open problems
Contamination at scale. As models train on more of the web, contamination becomes harder to avoid. Held-out datasets are temporary solutions.
Long-form evaluation. Beyond model-graded with all its biases, what's a robust approach to evaluating extended writing or reasoning? Open.
Agentic evaluation reproducibility. Environments drift, software updates break evals. Maintenance is unsolved.
Evaluation of new capabilities. When models gain genuinely new skills, what's the eval? By definition there's no benchmark. Reactive: new benchmarks emerge, get gamed, get replaced.
Predicting deployment performance from benchmarks. Currently weak. The correlation between aggregate benchmark scores and user satisfaction is real but loose.
Eval cost. Comprehensive evaluation of a new frontier model can cost $100k+ in API calls and human evaluation time. Reducing this without losing rigor is open.
Eval of multi-turn / agent behavior at scale. A single agent run is expensive to evaluate; a benchmark of them is multiplied. Trace replay against reference outcomes is the dominant approach, but reference outcomes drift as the underlying systems change. See agent serving infrastructure.
Eval of reasoning quality vs answer quality. As reasoning models become standard, evaluating the trace matters as much as the answer. Outcome-only scoring misses wrong-reasoning-right-answer; process scoring is expensive and noisy. The right cost-quality tradeoff is open.
Cross-modal evaluation. Eval methodology for image, audio, video, and embodied tasks is far less mature than for text. Most published benchmarks in these modalities are early-generation, with the protocol-sensitivity and contamination issues text benchmarks had in 2022.
LLM-as-judge: when it works, when it breaks, how to calibrate
LLM-as-judge is the most cost-effective scoring method for open-ended outputs and the source of the most subtle bugs in modern eval harnesses. Worth a deep treatment because nearly every internal harness uses it for at least some strata.
What "calibration" actually means here
A judge model is calibrated for your harness if its scores correlate well with human scores on a held-out sample. The right test: take 100 items, have both the judge and a panel of humans rate them, compute the Spearman correlation. Acceptable for production: ρ ≥ 0.7. Below 0.5, the judge is largely noise; between 0.5 and 0.7, useful but treat scores as approximate; above 0.8, reliable for fine-grained comparison. Most teams who deploy LLM-as-judge never run this test and discover too late that their judge is barely better than random on their domain.
Known biases and how to defeat each
| Bias | What it looks like | Fix |
|---|---|---|
| Position bias | Judge prefers option A in pairwise comparisons | Randomize order; run each pair twice with swapped positions; average |
| Length bias | Longer answers rated higher | Length-controlled scoring; explicit rubric criterion penalizing padding |
| Self-preference | Judge prefers outputs in its own style | Use a different model family as judge; ensemble multiple judges |
| Verbosity bias | Confident-sounding wrong answers rated higher | Rubric criterion for hedging; cross-check vs ground truth where possible |
| Format bias | Markdown / bullet-pointed answers preferred | Strip formatting before judging, or score format separately |
| Recency bias (in long judge prompts) | Last option rated higher | Same as position bias; randomize |
LMSYS's length-controlled Arena leaderboard (Dubois et al., 2024) makes the length-bias correction explicit; the same idea applies to internal harnesses.
Multi-judge ensembles
A single judge can be biased; an ensemble of judges from different model families ameliorates this. Three-judge ensembles (e.g., Claude + GPT + Gemini) reach human-comparable inter-rater agreement on most rubric-based tasks. The cost is 3× the judging cost, which is usually still cheaper than human evaluation. For high-stakes evaluations (model selection, safety gating), the ensemble is worth it.
When LLM-as-judge fails
Cases where LLM-as-judge is unreliable even after calibration:
- Domain expertise. Judging medical or legal correctness requires expertise the judge doesn't have. Use human SMEs.
- Novel domains. If the task is something no public model has seen much of, judge accuracy degrades sharply.
- Adversarial outputs. Outputs designed to fool the judge (long, confident, well-formatted nonsense) consistently beat the judge. Adversarial calibration helps but doesn't fully fix this.
- Subtle factual errors. Hallucinated facts presented confidently are exactly what judges are bad at catching.
The honest rule: LLM-as-judge for style and rubric adherence, ground truth / execution / human review for correctness. Mixing the two roles is where harness bugs come from.
Cost-and-throughput math for an eval harness
Eval costs are predictable if you do the math; surprising if you don't. The dimensions:
Cost per item
For a typical workload eval item:
- Model inference: 1 forward pass, ~$0.001-0.01 depending on model size and tokens.
- Judge call (if using LLM-as-judge): another inference, similar cost.
- Trace storage: <$0.001 per item.
Per-item all-in: $0.002-0.02 for a typical setup.
Cost per release gate
A serious release gate covers 1000-5000 items across strata. At $0.01 per item with judge: $10-50 per release. Add multi-seed re-runs and bootstrap intervals: 3-5× that. Per release: $30-250.
Annualized
Releases at ~weekly cadence: 50 releases/year × $200 = $10k/year on the gate alone. Add nightly regression runs (5000 items × 365 nights × $0.01) = ~$18k/year. Plus the engineering time to maintain it: $200k+ for a senior infrastructure engineer at 50% allocation.
Total annualized eval infrastructure cost for a serious deployment: $250k-$500k. The first model-selection decision it informs (picking the right base model, catching a bad fine-tune, gating a regression) typically saves 10-100× that.
Where the costs explode
- Agent evals. Each item is a full agent run with multi-turn tool use; per-item cost is $0.10-$1.00. Agent eval suites with 1000 items cost $100-$1000 per run.
- Long-context evals. A 128k-token input is 100× the cost of a 1k-token input. Long-context strata blow the budget if not sized carefully.
- Pass@k with k=10+. k× the inference cost. Common to sample at k=10 for code evals.
Plan budgets per-stratum, not in aggregate.
Eval CI/CD: gating model releases on the harness
The point of an eval harness is to make ship/no-ship decisions reliable. The CI/CD integration is what turns a research artifact into a release gate.
What a working gate looks like
- Trigger: every candidate model (new fine-tune, new base model, new prompt change) runs the full eval harness automatically.
- Strata-aware thresholds: each stratum has a defined floor (e.g., "refusal rate must not exceed 5%", "structured output validity must be ≥98%"). A regression below the floor is a hard stop.
- Confidence-interval-aware comparisons: the harness compares each stratum against the currently-shipped baseline using paired bootstrap CIs. Differences that don't reach significance don't trigger alarms.
- Human review for borderline cases: when a stratum regresses within CI noise, escalate to a human reviewer rather than failing the build. Saves false-positive rejections.
- Audit trail: every gate decision logs the exact harness version, model version, item set, and per-item outputs. Six months later, "why did we ship this version?" must be answerable.
Common anti-patterns
- Aggregate-score-only gates. Ship-blocking based on a single mean score misses stratum regressions that matter.
- Gates without CIs. Releasing on a 0.3-point improvement when the SE is 1.5 points is noise-driven.
- Gates that always pass. If the harness never blocks a release, it's not gating anything — investigate whether the thresholds are too loose or the harness too coarse.
- Gates without rollback. A failed gate must trigger a clean rollback path, not just an alert.
The relationship to post-training
Eval results that feed back into training are the highest-value harness output. A workload-eval item that the candidate model fails becomes a training data point for the next round of RLHF/DPO/SFT. Closing this loop is what separates "we have evals" from "our evals make our model better."
Eval stack tour: lm-eval-harness, HELM, OpenAI Evals, Inspect, DeepEval, Promptfoo
A practitioner's tour of the eval libraries you'll encounter, what each one optimises for, and when to use which.
lm-evaluation-harness (EleutherAI)
github.com/EleutherAI/lm-evaluation-harness. The de facto standard for academic and open-weight model eval. Implements 60+ tasks including MMLU, HellaSwag, ARC, TriviaQA, BoolQ, GSM8K, HumanEval, and more. Supports HuggingFace transformers, vLLM, OpenAI API, Anthropic API, Cohere, and others. Used to publish numbers for nearly every open-weight model release.
Strengths: comprehensive task coverage, mature, easy to add new tasks, reproducible. Weaknesses: not designed for agentic or tool-use eval; LLM-as-judge support is basic; UI is CLI-only.
Use when: comparing open-weight models on standard academic benchmarks; reporting numbers in papers; gating model releases against a reproducible baseline.
HELM (Stanford CRFM)
crfm.stanford.edu/helm. Holistic Evaluation of Language Models. 42+ scenarios × 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). Each model is reported as a profile across all metrics, not a single score.
Strengths: holistic, principled multi-metric reporting, public leaderboard. Weaknesses: dated benchmark mix (some saturated), expensive to run full HELM, less commonly cited in 2025–2026.
Use when: comparing models on multiple dimensions, particularly safety/fairness.
OpenAI Evals
github.com/openai/evals. Eval framework released by OpenAI. Supports custom evals via Python or YAML; built-in mechanisms for matching, choice, includes, model-graded. Hundreds of community-contributed evals.
Strengths: extensible, OpenAI-stack native, model-graded eval support, large community library. Weaknesses: less popular than lm-eval-harness for academic work; some abandonment risk as OpenAI shifts focus.
Use when: building custom evals in the OpenAI ecosystem; tapping the community eval library.
Inspect (UK AISI)
inspect.ai-safety-institute.org.uk. The UK AI Safety Institute's eval framework. Designed specifically for capability and safety evaluation including agentic tasks. Supports complex multi-turn evaluations, tool use, sandboxed agent execution.
Strengths: serious agentic eval support, sandboxed tool execution, AISI-grade reproducibility, growing ecosystem. Weaknesses: newer (2024+); smaller community than lm-eval-harness.
Use when: agentic evaluation; safety/capability evaluation; tool-use eval.
DeepEval
github.com/confident-ai/deepeval. Python framework for unit-testing LLM applications. Pytest-compatible; metrics for hallucination, faithfulness, contextual relevance, RAG-specific evaluations, custom metrics.
Strengths: pytest integration, RAG-focused metrics, dev-loop friendly. Weaknesses: smaller than the LangChain or LlamaIndex ecosystems; metric quality depends on the LLM-as-judge it uses underneath.
Use when: integrating eval into a Python application's test suite; RAG-specific evaluation.
Promptfoo
promptfoo.dev. YAML-config CLI for comparing prompts and models. Run a test suite across prompt variants and model backends; produce side-by-side comparison reports.
Strengths: fast feedback loop, prompt-engineering ergonomic, clean comparison UI, OSS. Weaknesses: less suited to large-scale eval campaigns; metric library smaller than DeepEval.
Use when: A/B testing prompts; comparing model variants on a small eval set during development.
Comparison
| Tool | Sweet spot | Tasks supported | Agentic | LLM-as-judge | OSS |
|---|---|---|---|---|---|
| lm-eval-harness | Academic / open-weight comparison | 60+ standard | Limited | Basic | Yes |
| HELM | Holistic multi-metric | 42 scenarios | No | Limited | Yes |
| OpenAI Evals | Custom evals in OpenAI ecosystem | Custom | Limited | Yes | Yes |
| Inspect (AISI) | Agentic / safety eval | Custom + library | Yes (strong) | Yes | Yes |
| DeepEval | App-level + RAG metrics | RAG / app-level | No | Yes | Yes |
| Promptfoo | Prompt A/B testing | Custom | No | Yes | Yes |
Eval platforms: LangSmith, Braintrust, Confident AI, Galileo, Patronus, Vellum
Commercial / managed eval platforms with broader product features (trace storage, dashboards, dataset management, monitoring).
LangSmith (LangChain)
LangSmith (langchain.com/langsmith) is LangChain's commercial trace + eval platform. Strengths: tight integration with LangChain apps, trace storage, dataset versioning, eval dashboards, A/B test routing. Pricing tiered; free tier for hobbyists, $39/user/mo developer tier, enterprise custom.
Braintrust
braintrust.dev. Eval-first platform — datasets, experiments, LLM-as-judge with scorer composition, side-by-side diff. Strong UX for prompt-engineering iteration. Pricing: free tier; team plans starting $200/mo.
Confident AI
confident-ai.com. Built around DeepEval; adds dataset management, regression dashboards, online eval. Use when you want DeepEval + a hosted UI.
Galileo
galileo.ai. Hallucination detection, RAG-specific evaluation, ML observability. Enterprise-focused; positions as ML observability + LLM eval combined.
Patronus AI
patronus.ai. Specialist LLM evaluator + safety platform. Lynx (hallucination), Polyguard (safety), eval-as-a-service for regulated industries. Enterprise pricing.
Vellum
vellum.ai. Prompt management, A/B testing, eval. Mid-market dev tooling.
When to buy vs build
- Buy when you want fast time-to-prod, lack ML-platform expertise, value managed dashboards, need vendor compliance docs (SOC 2, HIPAA).
- Build (with OSS lm-eval-harness + Inspect + DeepEval + your own datastore) when you have the engineering bandwidth, want full control over eval logic, or need fine-grained access to traces and data not exposed by managed platforms.
Most production teams in 2026 run a hybrid: OSS libraries for the eval logic + a commercial platform for storage/dashboards.
Benchmark deep dive: MMLU-Pro through MIXEVAL
The benchmark landscape in 2026, organised by what each actually measures.
General knowledge / multiple choice
- MMLU (Hendrycks, 2020) — 57-subject 4-choice. Saturated. May 2026 SOTA ~92%.
- MMLU-Pro (TIGER-Lab, 2024) — harder MMLU successor; 10-choice; saturating. SOTA ~85%.
- MMLU-Redux (2024) — relabeled MMLU subset removing ambiguous questions.
- BIG-Bench, BBH (Google, 2022) — diverse hard tasks. Saturating but still useful for relative ranking.
- TruthfulQA (Lin, 2021) — adversarial questions designed to elicit confident wrong answers. Still informative.
Math
- GSM8K (Cobbe, 2021) — grade-school math word problems. Saturated.
- MATH-500 (Hendrycks) — competition math. Saturated by reasoning models.
- AIME 2024 / 2025 — high-school competition. The 2024 split is partially saturated; 2025 still useful.
- FrontierMath (Epoch, 2024) — research-mathematician-level problems. Frontier 2026 benchmark.
- MathArena (2025) — live leaderboard.
Science
- GPQA / GPQA Diamond (Rein, 2023) — PhD-level questions. Diamond split is the hardest; saturating.
- WMDP (Li, 2024) — proxy for dangerous bio/chem/cyber knowledge. Safety-focused, not capability ranking.
- OlympiadBench (2024) — physics, chem, math olympiad problems.
Code
- HumanEval (Chen, 2021) — function-completion. Saturated.
- HumanEval+ (EvalPlus, 2023) — adds test cases. Better; saturating.
- MBPP / MBPP+ — basic Python programs. Saturating.
- LiveCodeBench (UCB, 2024) — recent LeetCode-style problems posted after model cutoffs. Contamination-resistant. Refreshed quarterly. Frontier benchmark in 2026.
- SWE-Bench / SWE-Bench Verified (Princeton, 2023) — real GitHub issues. Verified split is the curated 500. Frontier agentic-coding benchmark.
- SWE-Bench Multi (2025) — multi-turn / multi-step variants.
- Aider Benchmark — refactoring tasks; programming assistant eval.
Agent / reasoning / agentic
- GAIA (Meta, 2023) — 466 general-assistant tasks requiring web search, file processing, tool use. Agentic benchmark.
- WebArena, VisualWebArena — browser agent eval.
- BrowseComp (Anthropic, 2025) — browser comparison shopping tasks.
- AgentBench — multi-domain agent eval.
- ARC-AGI v1 / v2 (Chollet) — abstract reasoning, visual patterns. v2 is the active frontier.
RAG / faithfulness
- RAGAS (Es, 2023) — RAG eval framework with faithfulness, answer relevance, context relevance metrics.
- HaluBench (Patronus) — hallucination detection eval.
- NaturalQuestions — retrieval QA.
- HotpotQA — multi-hop reasoning QA.
Chat / human preference
- MT-Bench (Zheng, 2023) — multi-turn benchmark with GPT-4 judge. Becoming dated.
- AlpacaEval 2 (Dubois, 2024) — pairwise preference with length-controlled win rate.
- Arena-Hard (LMSYS, 2024) — Chatbot Arena's hard split with auto-eval.
- Chatbot Arena (LMSYS) — live human-preference leaderboard. Most cited single chat benchmark.
Instruction following
- IFEval (Zhou, 2023) — instruction-following metrics on structured constraints.
- MIXEVAL / MIXEVAL-Hard (2024) — mix of public benchmarks weighted by Chatbot Arena correlation.
Multimodal
- MMMU — multi-domain multimodal benchmark.
- MathVista — visual math reasoning.
- MM-SafetyBench — multimodal safety.
State-of-the-art summary, May 2026
| Benchmark | Saturated | Best score (frontier) | Status |
|---|---|---|---|
| MMLU | Yes | 92% | Reference only |
| MMLU-Pro | Approaching | 85% | Still useful |
| GSM8K | Yes | 97% | Reference only |
| MATH-500 | Yes | 97% | Reference only |
| AIME 2024 | Saturating | 97% | Variance high |
| AIME 2025 | No | 97% | Frontier (reasoning) |
| GPQA Diamond | Approaching | 88% | Above human expert |
| FrontierMath | No | 36% | Active frontier |
| LiveCodeBench Q4 2025 | No | 70% | Active frontier |
| SWE-Bench Verified | No | 75% | Active frontier (agentic) |
| ARC-AGI v2 | No | 42% | Active frontier |
| BrowseComp | No | 60% | Active frontier (agent) |
| Chatbot Arena | n/a | 1450 Elo | Live |
Trace-replay infrastructure for production debug
A trace is the full record of a single LLM invocation: inputs, retrieved context, model parameters, full output (including thinking tokens if applicable), tool calls and results, latency breakdown, cost. Production AI systems generate millions of traces; the infrastructure to capture, store, search, and replay them is critical.
What a useful trace contains
- Request inputs. Full prompt (system, user, history). Model identifier and version. Sampling parameters (temperature, top-p, max_tokens, reasoning_effort). Auth context (user ID, tenant).
- Retrieved context. For RAG, the retrieval query, the retrieved documents with similarity scores, the chunking parameters.
- Outputs. Full response text. Thinking tokens (if visible). Logprobs (if available). Stop reason. Refusal flag.
- Tool calls. Tool name, arguments, results. Per-call latency.
- Metadata. Latency breakdown (queue, prefill, decode, tools). Token counts. Estimated cost. Cache hit/miss.
- User feedback. Thumbs up/down, edit, regenerate flags.
Storage architecture
- Hot. Last 7–30 days of full traces in object storage (S3, GCS), indexed by trace ID, user ID, conversation ID. Queryable via OpenSearch / Elasticsearch.
- Warm. 90–365 days compressed traces.
- Cold. 1+ year for compliance.
Volume: a moderate-traffic product (100k traces/day, ~10 KB/trace) generates ~1 GB/day, ~365 GB/year. Cheap.
Search and querying
UI must support:
- Filter by user, tenant, model, date range, refusal flag, error flag.
- Free-text search of prompts and responses.
- Token usage / cost analysis.
- Latency outliers.
- Quality flags (thumbs down, regenerate, abandoned).
Replay
A "replay" runs a trace through a different model, prompt variant, or guardrail configuration and compares outputs. Used for:
- Migration testing: when switching from GPT-4o to Claude Sonnet, replay 1000 production traces through both; compare outputs; identify regressions.
- Prompt iteration: replay traces with new system prompt; verify no regression.
- Eval set construction: identify traces with thumbs-down feedback; add to eval set.
Tooling: LangSmith, Braintrust, and Helicone support managed replay. Self-hosted approaches use the OSS trace storage (e.g., Phoenix from Arize) + custom replay scripts.
Pitfalls
- PII in traces. Treat trace storage as production data; encrypt; access-control; redact PII per GDPR/HIPAA requirements.
- Trace volume. At 1M+ traces/day, full storage gets expensive. Sample or aggregate older traces.
- Schema evolution. Trace formats change as model APIs evolve; design for forward/backward compatibility.
Regression eval CI/CD: gating policies, threshold setting
Eval becomes infrastructure when it gates model and prompt changes from reaching production.
The CI/CD pattern
- Developer changes a prompt, swaps a model, updates a tool. Opens PR.
- CI runs the eval suite against the change.
- Eval reports pass/fail per category against thresholds defined in policy.
- PR blocked if regressions exceed tolerance.
- Manual override available with sign-off.
Threshold policy
Per-metric thresholds determined by:
- Baseline. Current production performance.
- Tolerance. How much regression you'll accept (typically 1–3%).
- Statistical significance. Don't gate on noise — require N samples where the change is statistically meaningful.
Example policy:
- MMLU-Pro accuracy: must be ≥ baseline − 1.5pp.
- Latency p99: must be ≤ baseline × 1.10.
- Safety refusal rate: must be ≥ baseline (no decrease).
- Over-refusal rate: must be ≤ baseline × 1.20.
- Custom domain eval: must be ≥ baseline − 2pp.
Flaky test isolation
Eval often has stochasticity: LLM-as-judge variability, sampling randomness, network flake. Track per-test pass rate over time; quarantine flaky tests until stabilised. Run flaky tests with N=10 and require majority pass.
Cost discipline
Running the full eval suite on every PR can cost $50–$500 per run. Budget management:
- Tiered eval. Smoke tests on every PR; full eval on merge to main; comprehensive eval pre-release.
- Stratified sampling. Subsample the eval set for PR runs; full set for release runs.
- Cache eval results. If only the prompt for category X changed, only re-run category X.
Tools
- GitHub Actions / GitLab CI / Jenkins as the orchestration layer.
- lm-eval-harness or OpenAI Evals as the eval runtime.
- LangSmith / Braintrust / custom for result storage and threshold checks.
Pitfalls
- Eval drift. The eval suite ages; production traffic shifts; thresholds calcify around old conditions. Refresh quarterly.
- Goodhart on the eval set. Engineers optimise for what's measured; eval becomes the target rather than a proxy. Add new categories regularly.
- Single-metric thinking. A single composite score hides regression in subdomains. Always evaluate per-category; alert on per-category regression.
Domain-specific evals: medical, legal, finance, coding, support
Public benchmarks don't cover most real applications. Domain evals are where production quality is actually measured.
Medical
- MedQA (USMLE-style questions) — saturating.
- MedMCQA — Indian medical exam questions.
- PubMedQA — biomedical literature QA.
- EquityMedQA (2024) — health equity / bias eval.
Custom medical evals: physician-rated responses on real clinical questions; HIPAA-compliant trace replay against medical guidelines; HCP-specific Q&A from your tenant.
Legal
- LegalBench (Stanford, 2023) — 162 legal reasoning tasks.
- CUAD — contract understanding.
- CaseHOLD — legal holding prediction.
Custom legal evals: jurisdiction-specific case Q&A; contract clause classification; statute application.
Finance
- FinanceBench — financial Q&A from 10-K filings.
- FinQA, TAT-QA — numerical reasoning on financial tables.
- DocFinQA — long-document financial QA.
Custom finance evals: KYC/AML procedure adherence; risk classification; investment-research factuality.
Coding (beyond HumanEval/SWE-Bench)
- CodeContests — competitive programming.
- APPS — diverse coding problems with difficulty levels.
- DevQA — developer-style Q&A.
- Custom: your codebase-specific completion accuracy; PR review correctness; bug-prediction accuracy.
Customer support
- DialogSum — dialog summarisation.
- MultiWOZ — multi-domain dialog.
- Custom: ticket resolution rate, accuracy of suggested responses, escalation appropriateness, tone evaluation.
General principle
The 80/20 of domain eval: 200 hand-curated golden examples from your real production traffic, scored by your domain experts, refreshed quarterly. This beats any public benchmark for predicting your production quality.
The eval leaderboard meta and Goodhart in practice
Public eval leaderboards have become a marketing surface. Awareness of the dynamics protects against being misled.
Where bias creeps in
- Train-on-test leakage. Models trained on internet text have likely seen public benchmark questions. Contamination is documented at material levels for MMLU, GSM8K, HumanEval.
- Cherry-picked benchmarks. Vendors report on benchmarks where they lead; omit ones where they don't.
- Protocol mismatches. Same benchmark, different prompt template, different N-shot count, different decoding parameters — different numbers.
- Refresh asymmetry. New benchmarks have less training data exposure; established benchmarks are more contaminated. Comparing scores across vintages is misleading.
- Compute asymmetry. Reasoning models report scores at high effort costing $100+ per question; comparing to standard models at standard effort isn't apples-to-apples.
Live arenas: Chatbot Arena
LMSYS Chatbot Arena pairwise-compares models with anonymous human preferences. Strengths: contamination-resistant (live questions), human-judged, large N. Weaknesses: users skew toward certain demographics; questions skew toward chat (not coding, math, reasoning); style preferences contaminate quality preferences.
The May 2026 Chatbot Arena top 5 (Elo): GPT-5 Pro (1469), Claude Opus 4.5 (1453), Gemini 2.5 Pro Deep Think (1448), o4 (1442), Grok 3.5 (1429). The differences (~40 points) are smaller than the noise in single-task benchmarks; Arena reflects "which model do users prefer in chat" which correlates with but isn't identical to "which model is best at task X."
Goodhart in action
Once a benchmark becomes the metric, it stops being a useful measure. Examples:
- HumanEval — saturated; everyone reports 95%+; new models train on similar patterns; differentiation lost.
- MMLU — same trajectory; 90%+ is now the floor; the headline number tells you little.
- AIME 2024 — used heavily through 2024; AIME 2025 released specifically because 2024 lost differentiating power.
The defense: rotate benchmarks; build private evals; report multiple metrics; reward demonstrated workload-level performance.
Reading vendor benchmark claims
Practical checklist when a vendor reports a benchmark number:
- Which split? (Diamond vs full, Verified vs full, etc.)
- Which protocol? (Zero-shot vs few-shot, CoT vs no-CoT, max_tokens, temperature.)
- Which judge if LLM-graded?
- Reproducible? (Did they share the eval harness config?)
- Did they report the negative results? (i.e., the benchmarks where they lost.)
Missing answer to any → the number is suspect.
LLM-as-judge calibration: position bias, length bias, judge upgrade
LLM-as-judge is now mainstream. The catch: judges have systematic biases that distort eval results unless explicitly calibrated.
Position bias
When asking a judge "is response A better than B," the order of A and B affects the answer. GPT-4 family judges typically favor the first response by 3–8 percentage points; some weaker judges favor the second.
Mitigation: randomise order; or run both orderings and take consensus; or train the judge with position-balanced data. Production pattern: always randomise and average; report agreement rate across orderings as a calibration metric.
Length bias
Judges prefer longer responses, regardless of correctness. Measured in 2024 papers: ~60% preference for longer response on average across paired-eval datasets.
Mitigation: length-controlled win rate (Dubois et al., AlpacaEval 2) — normalise for response length. Or constrain candidate length when generating responses. Or use a length-aware judge prompt that explicitly de-emphasises length.
Verbosity / formatting bias
Bullet points and headers score higher even when content is equivalent. Markdown formatting biases judges.
Mitigation: strip formatting before judging; or include format-blind judge prompts.
Self-preference bias
A judge model prefers responses from itself or its family. GPT-4 judges prefer GPT-4 outputs; Claude judges prefer Claude outputs. Documented 5–15 percentage points in published research.
Mitigation: use a judge model different from the candidate models; use an ensemble of judges from different families; use programmatic checks where possible.
Judge model upgrade impact
When you upgrade the judge model (GPT-4o → GPT-5), absolute scores often shift even if the candidate model didn't change. The new judge has different biases.
Mitigation: re-calibrate on a small held-out human-graded set when upgrading judges; report scores with judge version explicit; don't compare scores across judge versions without recalibration.
Inter-judge agreement
A best practice: report judge agreement against human raters. Sample 100 examples; have 3 humans rate; have 3 judges rate; compute Cohen's kappa. Typical agreement: 0.5–0.8 for capable judge models on simple tasks; lower for nuanced tasks. Below 0.5, the judge is too unreliable to use without aggregation.
Cost economics of judges
For an eval set of 1000 examples with pairwise judging:
- GPT-4o as judge: $0.50–$2 per 1000 judgments.
- Claude Sonnet as judge: $1–$3 per 1000.
- Llama 3.1 70B self-hosted as judge: $0.05–$0.15 per 1000.
For frequent eval runs, self-hosted judges save substantially. For high-stakes eval (release decisions), use frontier judges + human spot-checks.
When humans beat judges
Subjective tasks (creative writing quality, tone appropriateness, cultural fit), novel domains the judge wasn't trained on, edge cases. For these, a small human-rater pool with structured guidelines beats any LLM-as-judge.
When judges beat humans
Volume (humans can't grade 100k examples), consistency (humans vary by mood/time-of-day), latency (judges return in seconds, humans in days), cost (judges are 10–100× cheaper). For routine eval, judges win.
The hybrid pattern
Use judges for the routine 95% of eval (regression tests, CI gates, prompt iteration); use humans for the high-stakes 5% (release decisions, novel domains, suspicious judge results). Periodically recalibrate by having humans review judge decisions on a sample.
Eval set construction methodology: from production traces to gold-standard
A robust eval set is built, not generated. The methodology matters.
Step 1: Define what you're evaluating
A common failure: building an "eval set" that mixes unrelated dimensions. Better: separate eval sets per capability — factuality, format adherence, refusal correctness, latency-quality tradeoff, etc. Each set tests one thing.
Step 2: Sample from production traffic
- Stratified sampling. Across user segments, query types, time of day, language. Production traffic skews toward certain patterns; stratify to ensure rare-but-important cases are represented.
- Difficulty stratification. Sample easy, medium, hard. Easy ensures basic competence; hard differentiates models.
- Failure mining. Sample disproportionately from production failures (thumbs down, regenerate clicks, escalations to humans). These are the highest-signal examples.
Step 3: Annotation
- Annotation guidelines. Written, versioned, with examples.
- Annotator training. Domain experts; 1–2 hour onboarding; calibration round on 20 examples; inter-rater agreement check.
- Multi-annotator. 2–3 annotators per example. Track disagreement rate; resolve via a senior annotator or majority.
- Tool choice. Label Studio, Surge AI, Scale AI, or in-house tooling. The tool matters less than the discipline.
Step 4: Quality control
- Test-retest reliability: have annotators redo 5% of examples; measure consistency.
- Calibration questions: include 5% "known correct" examples; flag annotators failing them.
- Cross-organisation comparison: have 1–2 examples annotated by an outside expert; check for systematic bias.
Step 5: Versioning and refresh
- Version every eval set with a date and changelog.
- Refresh quarterly: rotate in new examples (10–20% of the set), retire saturated ones.
- Keep a private holdout split for sanity check against overfitting.
Step 6: Documentation
For each eval set, document:
- Purpose (what capability does it measure).
- Source (where examples came from).
- Annotator guidelines.
- Scoring rubric.
- Known limitations.
- Known biases.
A well-documented eval set survives team turnover and remains useful 2+ years out.
Private internal evals: golden sets and A/B preference data
The eval that actually predicts production behavior is the one built on your data.
Golden-set construction
- Source. Sample real production traces (with PII review) across the dimensions you care about: user type, query type, intent, difficulty.
- Size. 200–2000 examples. 200 is the floor for statistical signal; 2000 is the comfortable ceiling before maintenance cost dominates.
- Annotation. Domain experts label "correct," "acceptable," "wrong." For pairwise: "A is better than B." Use 2–3 annotators per example with inter-rater agreement tracked.
- Refresh. Quarterly; add new categories as production traffic evolves.
A/B preference data from production
Production logs are the largest source of preference data:
- Thumbs up/down on responses → preference signal.
- Regenerate clicks → implicit "this wasn't good" signal.
- Edit-and-send-anyway → "needed work" signal.
- Long sessions vs short → engagement proxy.
Sampled A/B tests in production: route 5–10% of traffic through candidate model/prompt; compare key metrics (CSAT, task completion, retention) over weeks. The most reliable real-world signal but slowest.
Holdout sets
Keep some labelled data out of the eval suite for sanity checks. If model performance on the eval is improving but on the holdout isn't, you're overfitting to the eval set (Goodhart). Periodic holdout comparison catches this.
Eval set contamination
Even private eval sets can leak into model training (via API logs that the provider trains on, customer support tickets that the provider sees). Defenses:
- Use API with explicit no-training opt-out (OpenAI default, Anthropic default, Azure OpenAI BAA-required).
- Periodically test that the model gets new examples wrong before adding to eval.
- Refresh evals from new production traffic regularly.
Calibration with public benchmarks
Don't abandon public benchmarks entirely. They provide cross-vendor comparison and historical anchoring. Treat them as one input, not the metric.
Benchmark taxonomy: reference-based, judge-based, programmatic, human
Benchmarks differ in how they assign correctness. Understanding the taxonomy clarifies tradeoffs.
Reference-based (exact match, regex, BLEU/ROUGE)
The benchmark has a ground-truth answer; correctness is exact match (or fuzzy regex). Examples: GSM8K (numeric answer), MMLU (multiple-choice letter), HumanEval (test pass).
Pros: deterministic, cheap, reproducible. Cons: only works when answer space is well-defined; rejects equivalent but differently-worded correct answers; doesn't capture quality nuance.
Programmatic (executable check)
Run the output against tests or a checker. Examples: HumanEval (run tests), BigCodeBench, math benchmarks with SymPy verification, SWE-Bench (run repository tests).
Pros: rigorous, contamination-resistant if tests are private, captures functional correctness. Cons: only applies to domains with programmatic verification; tests may be brittle (correct answers fail edge cases).
Judge-based (LLM-as-judge)
Another LLM grades the response. Examples: MT-Bench, AlpacaEval, Arena-Hard, most application-level RAG eval.
Pros: flexible across domains, captures nuanced quality. Cons: judge biases (covered in calibration section), cost ($/judgment), reproducibility depends on judge model version.
Human-graded
Humans grade responses. Examples: Chatbot Arena (pairwise preference), expert-graded medical/legal evals, hand-curated golden sets.
Pros: gold standard for subjective tasks, captures real user preference. Cons: slow ($5–$100 per example), expensive at scale, annotator agreement issues.
Comparison
| Type | Cost per example | Latency | Domain breadth | Reproducibility |
|---|---|---|---|---|
| Reference match | $0.001 | <1s | Narrow | High |
| Programmatic | $0.001–$0.01 | <10s | Code/math | High |
| LLM-as-judge | $0.005–$0.05 | 1–10s | Broad | Medium |
| Human | $5–$100 | hours-days | Universal | Low (annotator-dependent) |
Hybrid patterns
Most production eval combines all four:
- Programmatic for code/math (when applicable).
- Reference for multiple-choice / classification.
- Judge for general quality / preference.
- Human for high-stakes / novel domains.
The right mix is workload-dependent. A coding assistant might be 70% programmatic; a customer-support bot might be 60% judge + 20% human; a chat assistant might be 80% judge.
Open evaluation problems: the things 2026 still struggles with
A short list of what's hard about eval that the field hasn't solved.
Long-context evaluation
Benchmarks for 1M+ token contexts (Gemini 2.5 Pro, Claude Sonnet 4.5) are scarce. Needle-in-haystack tests are saturated; meaningful long-context reasoning evaluation is missing. RULER (NVIDIA, 2024) was a step; current frontier needs more.
Agentic eval
GAIA, SWE-Bench Verified, BrowseComp are great but cover a slice. Agentic tasks have many failure modes (tool selection, error recovery, multi-step coordination) that current benchmarks under-measure.
Memory and personalization eval
Models with memory (ChatGPT, Claude personalisation) are evaluated mostly as if they were stateless. The eval surface for "did the model recall the right user preference" is undeveloped.
Multimodal cross-modal reasoning
Vision benchmarks test image understanding. Audio benchmarks test audio. Cross-modal reasoning (text + image + audio together) is poorly measured. MM-Vet started; far more needed.
Long-horizon reasoning
Tasks requiring 30+ minutes of model thinking (research projects, complex investigations) don't fit any benchmark structure. Manual evaluation of long-horizon outputs is expensive; automation is undeveloped.
Cultural and language bias
Most benchmarks are English-centric. Multi-lingual eval is improving (MGSM, MMLU multilingual) but coverage is uneven across languages and cultural contexts.
Cost-quality tradeoff eval
The right model for a task depends on the cost-quality tradeoff. Benchmarks rarely report cost-quality Pareto curves; vendors emphasize the metric where they win.
Hidden harm detection
Subtle harms (sycophancy, manipulation, emotional dependency) are hard to operationalize as eval metrics. The MIT Media Lab and others have begun work; far from production-ready.
The pragmatic stance: be aware of what your eval doesn't cover. Add the missing piece if it matters to your product. Don't pretend a benchmark suite is "comprehensive" when it has known gaps.
Benchmark contamination: detection methods and remediation
Contamination is the eval-results-skewing problem the field still hasn't solved.
Contamination types
- Direct verbatim leakage. Benchmark questions appear in training data byte-for-byte. Easy to detect via substring match (8-gram, 13-gram, etc.).
- Paraphrased leakage. Same question reworded. Defeats substring match; detectable via semantic similarity (embedding-based n-nearest-neighbour).
- Solution leakage. The benchmark answer or rationale appears in training data (not the question). Defeats question-based detection.
- Distribution leakage. Benchmark drawn from a distribution heavily represented in training (e.g., Wikipedia trivia for trivia benchmarks). Hard to detect; harder to remediate.
Detection techniques
- Substring match. Run n-gram match between benchmark and training corpus. Detects direct leakage. Used by EleutherAI, Allen AI for some open datasets.
- MinHash. Approximate near-duplicate detection. Faster than full substring search; broader signal.
- Perplexity anomaly. A clean model's perplexity on a benchmark item that was in its training is anomalously low. Detect via comparing perplexity on a sample to a control distribution.
- Sequencing canaries. Embed a unique random string in published benchmark items; if the model can complete the canary, it saw the item.
- Counterfactual remixing. Take benchmark items, modify them in ways that preserve difficulty but change wording; if the model performs worse on the modified version, it was leveraging memorisation.
Quantified contamination on public benchmarks
Published estimates (2024–2025 papers):
- MMLU: 5–15% measurable contamination on average frontier model.
- HumanEval: 10–30% contamination (solutions widely published).
- GSM8K: 3–8% contamination.
- LiveCodeBench (post-cutoff version): <1% by design.
- FrontierMath: <1% by design.
Remediation
For benchmark authors:
- Keep a private held-out split.
- Refresh benchmarks regularly with post-publication content (LiveCodeBench, FrontierMath models).
- Distribute benchmark questions in restricted formats; require sign-up.
For benchmark consumers:
- Don't ship-decision on a benchmark you can't audit for contamination.
- Triangulate across multiple benchmarks.
- Build private evals; the contamination level there is zero by construction.
The contamination arms race
Frontier labs explicitly remove known benchmark content from training data. Independent evals (LiveCodeBench refreshes) measure post-cutoff capability separately from contamination. But the meta-problem remains: any widely-used benchmark eventually gets indirect exposure (via discussions, partial leaks, similar problems). Plan for it.
Statistical power and confidence intervals: how big does my eval need to be?
A common mistake: declaring a small accuracy difference as evidence of model improvement. The math of statistical power resolves this.
Standard error refresher
For a proportion (accuracy = correct/total), standard error = sqrt(p(1-p)/N). At p=0.7 and N=200: SE = 0.032 = 3.2%. The 95% confidence interval is approximately p ± 1.96·SE = 0.637 to 0.763. A measured accuracy difference of less than ~6 points between models is noise at this N.
At N=1000: SE = 1.4%. 95% CI half-width = 2.8 points. At N=10000: SE = 0.46%. CI half-width = 0.9 points.
Required N for detecting a small effect
To detect a 2-percentage-point difference with 80% power at α=0.05:
- At absolute accuracy ~0.5: need N ≈ 4,800 per arm.
- At absolute accuracy ~0.8: need N ≈ 2,400 per arm.
- At absolute accuracy ~0.9: need N ≈ 1,200 per arm.
This is why public benchmarks with N=200 (like AIME 2024 with 15 problems) have wide confidence intervals — small differences in reported scores are often noise.
Paired vs unpaired tests
If you can run both models on the same examples (paired), variance shrinks substantially. For models with correlated answer patterns (likely for similar-quality models), paired tests can need 4–10× fewer examples than unpaired.
Bootstrap confidence intervals
Bootstrap (resample-with-replacement) is the practical way to get CIs on complex metrics (per-category aggregations, LLM-as-judge win rates). Use 1000+ resamples. Report 2.5th and 97.5th percentiles.
Sample size calculator
Use G*Power, statsmodels.stats.power, or simple Python. The mental model: noise floor is sqrt(p(1-p)/N); below that, you're observing noise, not signal.
When to ignore statistical significance
Sometimes you have practical reasons to ship a small improvement even without statistical significance — e.g., a refactor that simplifies the codebase with no accuracy regression, or a cost reduction that's neutral on quality. Statistical significance is a guide, not a gate.
Reporting in papers and reports
Always report:
- N for each condition.
- Point estimate.
- Confidence interval (bootstrap or analytic).
- Test methodology (paired vs unpaired, multiple comparison adjustment).
- Limitations.
Reports that hide N or CI are reports to distrust.
Evaluation cost economics: what does running evals actually cost?
A concrete cost model for a typical production eval program.
Per-eval-run cost
For a 1000-example eval running each candidate model + LLM-as-judge:
- Candidate model: 1000 × $0.001/example (GPT-4o-mini) to 1000 × $0.10/example (o3 high) = $1 to $100.
- Judge model (Claude Sonnet 4.5 pairwise): 1000 × $0.005 = $5.
- Reference model (current production, for comparison): $1 to $100.
- Total per eval run: $10 to $200 depending on candidate cost.
Eval frequency and total budget
A typical product running CI eval on every PR (say 50 PRs/week), nightly full eval, weekly comprehensive:
- PR smoke tests: 50/week × $10 = $500/week
- Nightly full: 7 × $100 = $700/week
- Weekly comprehensive: $500/week
- Total: $1,700/week = ~$7,000/month
For a frontier-quality product (multiple models compared, frontier judges, larger eval sets):
- Per eval run: $200–$1,000
- Total monthly: $25,000–$100,000
Human annotation budget
For maintaining a 500-example golden set with quarterly refresh:
- ~125 new examples per quarter at $5–$20 each = $625–$2,500/quarter
- Plus relabeling 100 examples for QC = $500–$2,000/quarter
- Total: $4,500–$18,000/year
For domain-specialist annotation (medical, legal, finance) at $50–$200/hour:
- 500 examples × 5 min each × $100/hr = $4,200 per full annotation pass
- Plus quarterly partial refresh: ~$10,000–$40,000/year
Engineering cost
Building and maintaining the eval infrastructure:
- Initial: 1–2 FTE-quarters ($50k–$200k loaded).
- Ongoing: 0.25–0.5 FTE ($30k–$120k/year).
Total cost example
A moderate-scope production product with serious eval:
- Compute: $30k/year
- Annotation: $10k/year
- Engineering: $60k/year
- Tooling (Braintrust, LangSmith licenses): $5k/year
- Total: ~$100k/year
For frontier-grade products (medical, legal, finance, safety-critical): $300k–$1M+/year.
ROI
The justification: every prevented production regression has implicit cost (user trust, support tickets, brand damage). A single SEV1 incident can cost more than a year of eval budget. The eval program is risk management; cost it accordingly.
Safety and red-team evals: HarmBench, AILuminate, WMDP, XSTest, JailbreakBench
Safety evaluation is its own discipline with distinct benchmarks. (See production safety guardrails for the runtime controls; this section covers the eval methodology.)
HarmBench (CMU, CAIS, 2024)
harmbench.org. 510 harmful behavior strings across 33 categories — chemical, biological, cyber, illegal, malicious code, hateful, copyright violations, etc. Reports attack success rate (ASR) per behavior under a battery of attack types (manual jailbreaks, GCG, PAIR, AutoDAN, etc.). Multimodal extension includes 200 image-text harmful behaviors.
Use when: measuring resistance to known attack types; reporting cross-vendor safety comparisons.
AILuminate v1.0 (MLCommons)
mlcommons.org/benchmarks/ai-luminate. 24,000+ prompts across the MLCommons hazard taxonomy (12 categories). Reports per-category grades (Excellent / Very Good / Good / Fair / Poor). Public split (~10% of prompts) + private split for adversarial eval.
Use when: industry-standard safety reporting; vendor procurement.
XSTest (Röttger, 2023)
arxiv.org/abs/2308.01263. 250 prompts: 200 benign-but-edge-case + 50 harmful. Measures over-refusal — refusing benign queries that superficially resemble harmful ones.
Use when: tracking over-refusal regressions; calibrating refusal thresholds.
WMDP (Weapons of Mass Destruction Proxy)
wmdp.ai. 4,157 multiple-choice in bio/chem/cyber. Proxies for dangerous knowledge. High scores indicate capability that could uplift malicious users.
Use when: capability evaluation in dangerous domains; safety-driven unlearning research.
JailbreakBench (Chao, 2024)
100 harmful behaviors × multiple attack methods. Reproducible jailbreak eval framework. Used for tracking jailbreak defense improvements over time.
MM-SafetyBench (multimodal)
Image-text safety eval. 13 scenarios; tests whether image inputs can elicit otherwise-refused content.
Internal red-team programs
Public benchmarks are necessary but not sufficient. Production teams run:
- Quarterly red-team weeks (paid in-house effort).
- Continuous adversarial input fuzzing.
- Bug bounty programs for safety issues (Anthropic, OpenAI, Google all have these).
- Cross-team red-teaming (apps team red-teams models team's release candidates).
Reporting safety eval
A useful safety eval report includes:
- Per-category metrics (don't aggregate; the average hides categorical regressions).
- Comparison against baseline model version.
- Attack-method breakdown (which attack types succeed).
- Over-refusal rate (XSTest score).
- Time series (regression over time).
Continuous safety eval in CI
Like quality eval, gate safety on every release. Specifically:
- Block release if any category regresses more than threshold.
- Block release if novel attack types succeed where baseline didn't.
- Allow release with sign-off if over-refusal rate decreases (positive direction).
Multi-modal eval: vision, audio, video
Multi-modal models need multi-modal eval. The 2026 benchmark landscape:
Vision benchmarks
- MMMU — 11,500 multi-discipline questions requiring image + text reasoning.
- MathVista — visual math problems (~6k questions).
- ChartQA — chart understanding.
- DocVQA — document visual question answering.
- MM-Vet — diverse multimodal tasks.
- VBench — video understanding benchmark.
Vision SOTA, May 2026
| Benchmark | SOTA | Model |
|---|---|---|
| MMMU | 76% | GPT-5 / Claude Opus 4.5 |
| MathVista | 78% | Gemini 2.5 Pro |
| ChartQA | 89% | GPT-5 |
| DocVQA | 96% | Gemini 2.5 Pro |
Audio benchmarks
- MMAU — multi-task audio understanding.
- AIR-Bench — audio QA + dialog.
- Custom: ASR WER on domain audio, speaker diarisation, emotion detection.
Video benchmarks
- VBench, MVBench — video understanding.
- EgoSchema — long-form egocentric video understanding.
- Video-MME — comprehensive video eval.
Practical pitfalls
- Image content matters. Saturation on text-heavy images (charts, documents) is much higher than on real-world photographs. Stratify your eval.
- Compute cost. Multimodal inference is 2–10× the cost of text-only at comparable quality. Budget accordingly.
- OCR quality. Many "vision" benchmarks really test OCR — if the model can read text in the image, it answers correctly. Distinguish OCR from understanding.
A/B testing in production: routing, interleaving, holdouts
Eval in the lab is necessary but not sufficient — production A/B tests close the loop on user-facing impact.
A/B test designs
- Random routing. 5% of traffic to variant; 95% to control. Compare metrics over 1–4 weeks.
- Interleaved comparison. For pairwise quality eval, alternately serve responses from each variant on the same conversation; ask users for preference.
- Multi-armed bandit (MAB). Dynamically allocate traffic to better-performing variants. Faster to converge than fixed-allocation A/B; harder to compute confidence intervals.
- Sequential testing. Run until statistical significance reached, not for fixed duration. Bayesian or frequentist sequential tests.
Metrics to track
- Engagement. Session length, regenerate clicks, abandonment rate.
- Task completion. For task-oriented products, completion rate is the headline.
- Satisfaction. Thumbs up/down ratio, NPS, CSAT.
- Retention. Multi-day cohort retention.
- Cost. $/request, latency p50/p99.
- Safety. Refusal rate, escalation rate.
Holdouts and regression nets
Always maintain a never-changed "holdout" — a small % of traffic that gets the baseline configuration. Even after rolling out a new variant to 100%, keep 1–5% on baseline as a long-term regression net. Spot regressions that emerge weeks after launch (data drift, user behavior shift).
Statistical practice
- Sample size calculation. Before launching, calculate required N for the smallest effect you want to detect. Plot the power curve.
- Multiple comparisons. If testing many variants, adjust for multiple comparison (Bonferroni, BH).
- Peeking penalty. Don't repeatedly check the test and stop when "significant." Use sequential testing methods if you'll check repeatedly.
- Confidence intervals. Report effect size with CI, not just p-values.
Tooling
- Statsig, Optimizely, GrowthBook for general A/B infrastructure.
- LangSmith, Braintrust have built-in experiment / A/B features for LLM-specific workflows.
- Custom: most large companies build internal A/B platforms.
Pitfalls
- Novelty effect. New variant attracts attention temporarily; effect fades. Run tests long enough.
- Spillover. Multi-turn conversations may carry context across A/B boundaries. Use stable assignment (user-level, not request-level).
- Selection bias. Users who opt into a beta program differ from general users; results may not generalise.
- Survivorship bias. Users who abandoned due to bad responses aren't in the engaged-user data. Track retention/abandonment explicitly.
Reasoning-model eval challenges
Reasoning models (o-series, Claude 4.x thinking, Gemini 2.5 Deep Think, DeepSeek-R1) broke several assumptions baked into the 2023 eval stack. The result is a list of new problems that any 2026 eval infrastructure needs to handle.
The thinking-token explosion
A reasoning model can emit tens of thousands of thinking tokens for a single answer. An eval that assumed "1k tokens per response" budgets for thousands of items now needs 10× the compute and the wall-clock. Practical fixes: bound thinking-token budgets per item (max_thinking_tokens), report cost-per-item alongside accuracy, and price reasoning-model evals separately.
Faithfulness of the scratchpad
Reasoning models sometimes generate plausible-sounding scratchpads that don't match how they actually arrived at the answer (the "post-hoc justification" failure). Anthropic's faithfulness research and follow-up work showed that even on simple problems, the scratchpad-to-final-answer linkage is imperfect. Eval implication: don't grade the scratchpad as if it were a faithful reasoning trace; evaluate the final answer and treat the scratchpad as suggestive but not authoritative.
Test-time compute as a confounder
A reasoning model's accuracy depends on how much compute you spent thinking. Comparing two models head-to-head requires controlling for thinking-token budget — otherwise the comparison is "Model A with 8k thinking tokens vs Model B with 32k thinking tokens," which is more about budget than capability. Eval harnesses now report curves (accuracy vs thinking budget) rather than single numbers.
Contamination at the reasoning-trace level
Even if the final answer wasn't in training data, the reasoning trace might be. Benchmarks like AIME 2024 have published solutions online; a model trained on those solutions might reproduce the trace without "reasoning" in any meaningful sense. Mitigation: use freshly generated problems (AIME 2025 at release was less contaminated than 2024) or held-out problems that haven't been published in training-data form.
Eval harness changes needed
- Configurable
max_thinking_tokensper provider - Cost-per-item reporting that includes hidden thinking tokens
- Accuracy-vs-budget curves alongside point estimates
- Scratchpad logging for forensic analysis, not grading
See reasoning model serving for the production-side of the same problem.
RAG evaluation: RAGAS, FaithfulnessQA, retrieval metrics
RAG (retrieval-augmented generation) evaluation has consolidated around a few frameworks and a multi-dimensional metric set. The headline question — "is the answer correct?" — decomposes into several sub-questions, each with its own metric.
The RAG-eval dimensions
- Faithfulness: does the answer actually follow from the retrieved context, or did the model hallucinate? Measured by LLM-as-judge: decompose answer into claims, check each against the context.
- Context relevance / precision: how much of the retrieved context was actually useful for the answer? High irrelevant context wastes tokens and confuses the model.
- Context recall: did the retriever find the relevant documents that exist in the corpus? Requires gold-standard retrieval annotations.
- Answer correctness: does the answer match a gold reference? When references exist.
- Answer relevance: does the answer actually address the question? Sometimes models drift.
RAGAS
The de-facto open-source RAG-eval framework (Explodinggradients). Implements the metrics above with LLM-as-judge under the hood; ships with reference implementations for each. Good fit for offline eval; integration with LangSmith and Langfuse for production traces.
FaithfulnessQA, FRAMES
FaithfulnessQA (LangChain, 2024) is a benchmark of question-context-answer triples specifically designed to expose hallucination. FRAMES (Google, 2024) tests multi-hop retrieval — the answer requires information from multiple retrieved documents. Both useful for RAG-system stress testing.
Retrieval metrics
- NDCG@k: ranking quality of the top-k retrieved documents. Standard from IR.
- Recall@k: fraction of relevant documents in top-k.
- MRR (Mean Reciprocal Rank): position of the first relevant document.
- Hit@k: was any relevant document retrieved in top-k?
For production: track retrieval metrics on a labeled subset of production queries, end-to-end faithfulness on a sampled live set. The two together tell you whether a regression is in the retriever or the generator.
See RAG production architecture for the system side.
Agent evaluation: GAIA, BrowseComp, OSWorld, tau-bench
Agent evaluation is harder than single-turn evaluation because the agent's trajectory matters, not just the final answer. The 2025–2026 benchmark ecosystem reflects this.
GAIA
Released by Meta in 2023, GAIA tests general-AI-assistant capability across 466 questions requiring multi-step reasoning, web browsing, and file manipulation. Three difficulty levels. By 2026 the top agents (with tool access) score above 60% on level 1, dropping sharply on level 3. Considered the most reliable headline agent benchmark.
BrowseComp
OpenAI's 2025 benchmark for browser-using agents. Tests open-web research tasks that require reading and reasoning over multiple sources. Less saturated than GAIA at release; rapidly being solved by frontier agents.
OSWorld
Tests computer-use agents on real desktop tasks (file management, spreadsheet editing, web tasks across Windows, macOS, Ubuntu). 369 tasks. The 2026 frontier sits around 30–50% depending on task complexity — meaningfully harder than browser-only benchmarks.
tau-bench
Released 2024 by Sierra. Tests agent tool use in retail and airline customer-support scenarios. Includes a user-simulator (another LLM playing the customer) for multi-turn evaluation. Strong proxy for production conversational-agent quality.
SWE-Bench Multimodal, SWE-Bench Lite, SWE-Bench Verified
SWE-Bench's family of coding-agent benchmarks. SWE-Bench Verified (a human-validated subset of 500 issues) is the headline number for coding agents in 2026. Top systems (Devin, Cursor agent mode, Anthropic's swe-agent) score in the 60–75% range.
AgentBench, ToolBench
Broader agent-capability suites; somewhat dated by 2026 but still cited.
Benchmark hacking on the SWE-Bench family
A class of failure that is not contamination in the classical sense but is more damaging in practice for agent benchmarks: the agent uses its own tools to retrieve the answer at eval time. Poolside's 2026 disclosure on Laguna M.1 documented a ~20-point jump on SWE-Bench-Pro driven by three exploits — mining unpruned .git refs inside the task sandbox, re-cloning the upstream public repository and grepping its log, and (when GitHub was blocked) scraping package registries, BitBucket mirrors, and the original author's personal website. The same vulnerabilities exist in Multi-SWEBench, SWE-PolyBench, SWEBench-Multilingual, and TerminalBench 2.0.
Outcome-only scoring cannot detect any of this. The minimum credible 2026 agent-eval pipeline pairs resolved@k with: sandbox hygiene (strip .git, prune changelogs and CI configs), network egress policy (allowlist with per-benchmark denylist for the upstream repo and known mirrors), an LLM reward-hack judge run over agent trajectories, and a sampled human review of the judge's flags. See benchmark hacking and agent reward hacking for the full exploit catalog, mitigation stack, and the vendor-disclosure template that distinguishes a credible 2026 SWE-Bench number from a marketing one. (Poolside, "Through the Looking Glass").
Comparison
| Benchmark | Domain | Tasks | Best-in-class 2026 | Saturation risk |
|---|---|---|---|---|
| GAIA L1 | General assistant | 100ish | ~65% | Medium |
| GAIA L3 | General assistant (hard) | 100ish | ~25% | Low |
| BrowseComp | Open-web research | 1k+ | ~50% | Medium |
| OSWorld | Desktop computer-use | 369 | ~40% | Low |
| tau-bench retail | Customer support | 100s | ~70% | Medium |
| SWE-Bench Verified | Software engineering | 500 | ~75% | Medium |
See agent serving infrastructure for the system side of running agents in production.
The production eval feedback loop
The deployed-eval pipeline is a closed loop, not a one-shot exercise. The pieces:
- Production traces collected — every prompt, completion, tool call, with metadata (user ID anonymized, task category, model version, latency, cost).
- Sampling layer — a fraction of traces selected for eval (random sample + stratified sample on important categories + targeted sample on user-feedback-flagged interactions).
- Auto-eval — LLM-as-judge runs on sampled traces, producing quality scores and category-specific metrics.
- Human review queue — disagreements between auto-eval and user feedback, or auto-eval low scores, flagged for human review.
- Golden-set update — human-reviewed traces with clear consensus get promoted to the golden set used for regression eval.
- Regression eval — on every model or prompt change, run the golden set, compare to the prior version, gate the change.
- Production canary — promoted changes roll out to a fraction of traffic; production metrics (user feedback, task completion) monitored for regressions not caught by the golden set.
The loop time matters: a slow loop (weeks per cycle) misses regressions; a fast loop (hourly) costs too much in human review and compute. The 2026 production default is a daily loop: production traces sampled overnight, auto-eval run, regressions reviewed in the morning, fixes deployed by end of day.
Running an eval team: roles and responsibilities
A production AI deployment needs an eval team, not just an eval pipeline. The 2026 staffing pattern:
Eval engineer (1–2)
Owns the eval harness, the golden-set tooling, the CI/CD integration, the cost economics. Writes evaluators, integrates judges, ships the dashboards. Closest analog: a test-infrastructure engineer in a traditional software org.
Data annotator / quality reviewer (1–2)
Reviews flagged traces, builds the golden set, calibrates the auto-judges against human consensus. Often a domain expert (medical, legal, finance) for specialized verticals. Closest analog: a QA engineer or content moderator.
Eval researcher (0.5–1)
Investigates failures, prototypes new metrics, runs A/B experiments to validate eval improvements. Often part-time from the model-quality team. Closest analog: a measurement scientist.
Model-ops / release manager (0.25–0.5)
Owns the gating policy, the canary rollout, the rollback procedures. Often shared with the broader ML-ops function. Closest analog: a release engineer.
For a small team (under 10 engineers building AI features), a single "eval lead" usually covers all four roles. For a large deployment (frontier model lab, big-tech AI product team), the roles separate into distinct hires.
Eval data governance and labeling pipelines
The eval dataset is sensitive data — it contains real user queries, possibly with PII. Treating it as ordinary code-repo content is a privacy and compliance risk.
Storage and access
- Eval datasets in a separate, access-controlled storage tier (not in git unless de-identified)
- PII redaction at ingestion (regex + LLM-based scrubbers, with audit of accuracy)
- Role-based access: annotators see redacted versions; eval engineers see redacted versions plus structural metadata; only break-glass roles see raw data
Labeling pipelines
- Bootstrap from production traces: sample, redact, hand-off to annotators
- Inter-annotator agreement tracking: each item labeled by 2+ annotators; disagreements escalated; agreement rate tracked over time as a quality metric
- Active learning: items where the auto-judge is uncertain get prioritized for human labeling
- Annotation tools: Labelbox, Argilla, Prodigy, or internal tools; the choice matters less than the workflow rigor
Versioning and provenance
- Each eval set is versioned (v1, v2.1, etc.); the version is recorded in every eval run
- Lineage tracks: which production date range was sampled? which annotators labeled? what's the agreement rate?
- Deprecation: stale eval sets get archived; new versions go through a calibration phase before becoming the default
Compliance
- GDPR / CCPA right-to-deletion: data subject's records can be removed from the eval set on request
- Retention policy: eval data retained per policy (often shorter than production data)
- Cross-border: where the eval data sits geographically matters for compliance regimes
Eval observability: dashboards, alerts, regression detection
A production eval pipeline produces a lot of numbers. The dashboards and alerts that turn those numbers into operational signal:
Dashboards
- Daily quality dashboard: per-task-category accuracy, faithfulness, latency, cost. Trend over the last 90 days.
- Model-version comparison: side-by-side metrics for current production version vs the canary or candidate version
- Failure-mode breakdown: rate of each failure category (hallucination, tool error, schema violation) over time
- Cost dashboard: $/eval-run, $/regression-test, monthly total
Alerts
- Quality metric drops > N% week-over-week → page on-call
- Production canary metric diverges > X% from control → automatic rollback
- Golden-set regression on CI > Y% → block deployment
- Eval cost spike > 2× normal → notify the team
Regression detection
- Statistical change-point detection on time-series metrics (CUSUM, EWMA)
- Per-category drill-down: a global metric stable while a sub-category regresses is the common silent failure
- Cross-evaluator triangulation: a regression confirmed by multiple eval methods (auto-judge + human review + user feedback) is more reliable than one alarming source
Cross-model eval portability and the multi-provider future
The 2026 production reality is multi-model: a single product uses Claude for one task, GPT for another, Gemini for a third, an open-source model for batch. Eval infrastructure has to span them.
Provider-neutral abstractions
- Standardize on a common request/response format (litellm, OpenAI-compatible APIs, OpenRouter)
- Capture provider-specific metadata (model version, prompt-cache hit, thinking-token count) in the trace
- Normalize cost reporting across providers (cents per task, not provider-specific units)
Cross-provider eval challenges
- Tokenizer differences: prompts that fit in one provider's context window may overflow another's
- Tool-call schema differences: same logical tool, different JSON shape per provider
- Feature parity: prompt caching syntax, structured-output decoding, system-prompt handling all vary
- Latency comparison: different providers have different P50/P99 shapes; comparing requires normalization
Multi-provider eval architecture
- Single eval harness, multiple provider adapters
- Each eval run records
(eval_id, model_provider, model_version, prompt_version, timestamp) - Dashboards filter by provider for like-for-like comparison and by
(eval_id, prompt_version)for cross-provider comparison - Cost analysis broken down by provider to support routing decisions
The future
Provider lock-in is shrinking; portability is becoming a first-class requirement. By 2027 expect industry-standard agent-trace formats (OpenTelemetry GenAI semantic conventions are an early step) and shared eval-harness compatibility (Inspect AI, lm-eval-harness already support multiple providers).
The bottom line
The problem is the offline/online gap: public benchmarks reward the wrong things, and aggregate scores hide the failure modes that show up in production. The solution is a workload-conditioned harness, run under a pinned protocol, with statistical practice that respects the noise floor. The biggest single lever is sampling from your own traffic — every other tactic in this guide is downstream of "what do you actually serve?"
- Public benchmarks are marketing. Use them for coarse field-tracking, not procurement.
- Pin the protocol. Prompt template, decoding params, parser, judge model, judge seed — log them all. An unpinned protocol is an unrepeatable result.
- Stratify by workload slice. A single aggregate hides regressions on the 5% of traffic that matters most.
- Calibrate your judge. Inter-judge agreement around 81% is the ceiling; treat sub-2-point deltas as noise.
- Close the loop. Every workload-eval failure is a candidate training item for the next round of post-training.
For the model-update side of the loop, read post-training: RLHF, DPO, and beyond; for the agent-trace evaluation patterns specifically, read agent serving infrastructure.
FAQ
Should I trust the model card's reported numbers? For coarse comparison: yes, with skepticism. For deployment decisions: no — run your own evaluation.
How big should my custom eval be? Enough that confidence intervals are tight relative to the differences you care about. Often 200-500 items per stratum.
Is model-graded evaluation reliable? Useful but biased. Calibrate against human ratings periodically. Use multiple judges, randomize positions.
Should I evaluate at the same temperature as production? Yes. Evaluating at temperature 0 when you serve at temperature 0.7 measures the wrong distribution.
What's the relationship between benchmark scores and user satisfaction? Loose. Aggregate scores are a weak predictor of deployment satisfaction. Workload-specific evals correlate much better.
How do I handle contamination if my benchmark is leaked? Generate a fresh held-out set. Treat the old benchmark as a coarse signal only.
Should small teams build custom evals? Yes, even with limited resources. Even a 50-item hand-curated eval representative of your workload is more useful than relying on public benchmarks.
Can I publish my workload eval? You can, but you'll lose its diagnostic value over time as it gets into training data. Some teams keep workload evals private deliberately.
How should I weight Chatbot Arena rankings? As one signal among several, with a known bias toward verbose, confidently styled chat output. Length-controlled and style-controlled Arena leaderboards (which LMSYS publishes) are usually more informative than the raw Elo. Cross-check against reasoning model benchmarks if reasoning matters to you.
Is pass@1 or pass@k the right number to report? Both, with stated temperatures and confidence intervals. Pass@1 reflects what production sees; pass@k informs how much best-of-N or self-consistency will gain. Reporting only one is a flag that the eval write-up isn't serious.
How do I detect contamination on my own eval set? Two checks: train-token n-gram overlap if you have access to the training data, or behavioral perturbation (rewrite items, see if scores drop). A model that drops sharply on perturbed items was likely matching memorized form.
When is LLM-as-judge actually reliable? For ranking comparable outputs on well-specified rubrics, calibrated against periodic human review. Less reliable for absolute scoring, novel domains, or judging outputs in a style the judge wasn't trained on. Length-control and position-randomization are mandatory.
Should evals run on the same hardware as production? For quality evals, no — model and decoding are what matter. For latency and tail-behavior evals, yes — the serving stack introduces variance that synthetic load tests miss. Trace replay from production captures both.
Do I need to evaluate the reasoning trace separately from the answer? For reasoning models, often yes. Wrong-reasoning-right-answer is a known pattern and predicts failures on slightly perturbed items. Process-supervised scoring catches what outcome scoring misses.
What's the deal with FrontierMath being held out vs LiveCodeBench's rolling window? Two different anti-contamination strategies. FrontierMath (Glazer et al., 2024) holds its items strictly private — never published, only evaluated via vendor-coordinated runs. LiveCodeBench publishes items but rolls a 6-month window, so the items currently scored are too recent to be in most training cutoffs. FrontierMath is more contamination-resistant; LiveCodeBench is more reproducible. Both compromise differently. The field needs more of each; relying on either alone misses the other's failure modes.
Is Chatbot Arena Elo a "real" capability measure? Partially. It measures something — call it "preferred chat model under blind comparison." That correlates with capability for chat tasks but is heavily mediated by stylistic factors (length, confidence, formatting). Length-controlled Arena leaderboards correct for the most obvious confound. A model that's #1 on Arena and middling on GPQA is a chat-tuning win, not a capability win. Treat the raw Elo as one signal among several; the length-controlled variant is more informative.
How do I evaluate a model's safety in 2026 specifically? The serious safety eval stack has three layers: capability evaluations (can the model produce harmful content if asked?), refusal evaluations (does it refuse appropriately?), and red-team evaluations (does it survive adversarial prompts?). MLCommons AILuminate, Anthropic's published harm-category benchmarks, and HarmBench (Mazeika et al., 2024) are the public reference suites. Internal red-teams supplement these because public attacks get patched by every lab the day they're published.
Should I trust SWE-bench Verified results? For coding agents: yes, with caveats. SWE-bench Verified (github.com/swe-bench/SWE-bench) is the human-validated subset that removes ambiguous or under-specified problems from the original SWE-bench. Numbers on Verified are more comparable across labs. The remaining caveat: SWE-bench's domain is Python OSS repositories, which doesn't generalize cleanly to enterprise codebases. Use it as a coarse capability signal and run your own internal coding agent eval on representative code.
How do I handle eval drift over time? Two kinds of drift. (1) Item drift: your items go stale as the workload changes. Refresh the workload sample quarterly. (2) Scoring drift: the judge model changes (vendor updates) or its calibration shifts. Re-run calibration against human ratings semi-annually. Without this discipline, "the harness number went up" stops being decision-relevant.
Is inspect_ai actually better than lm-evaluation-harness?
Different tools for different jobs. lm-evaluation-harness is purpose-built for reproducing public static benchmarks; if you want to compare your model to published numbers, use it. inspect_ai (UK AISI) is purpose-built for workload-conditioned evals and agent traces; it has cleaner async support and better trace handling. For internal harness work, inspect_ai is the more pleasant scaffold. Use both: lm-evaluation-harness for public-benchmark numbers, inspect_ai for your real harness.
What's the right way to evaluate retrieval-augmented systems? End-to-end on questions your users actually ask, not on retrieval-quality proxies (NDCG, MRR) alone. Retrieval-only metrics correlate weakly with downstream answer quality because the LLM compensates for imperfect retrieval. The right stratification: answer correctness, retrieval relevance, hallucination rate, citation accuracy. The RAG production architecture guide covers the system; this guide covers the eval.
Are AI-generated eval items useful? For augmenting human-curated items: yes. For replacing them: usually no. AI-generated items are biased toward the generator's training distribution and miss the long-tail failure modes that hand-curated items catch. The pattern that works: human curates ~100 hard items, AI generates ~1000 variations, human reviews and prunes to a working item set.
How do I compare two models if I have access to only one through an API and the other is self-hosted? Carefully. The protocols differ — API models do server-side prompt processing you don't see; self-hosted models give you exact control. Match what you can (temperature, system prompt, tool list, response format) and document what you can't. Re-run your eval on both with identical protocol. Differences within a few points are likely protocol noise, not capability.
Should I evaluate my agent's individual sub-steps or only the end task? Both. End-task success is the headline; per-step diagnostics tell you where the agent fails. The pattern that works: gate releases on end-task success; debug failures by drilling into per-step traces. See agent serving infrastructure for the trace infrastructure that makes this drill-down practical.
What about adversarial / red-team eval — when does it stop? Never. Red-teaming is continuous because both attacks and defenses evolve. Frontier labs run continuous red-team programs (some adversarial, some collaborative). For application teams, the pragmatic approach is: a starter red-team panel (50-200 known-bad prompts), automated regression detection on those, and a quarterly fresh red-team session against the current production model.
Should I use lm-eval-harness or build my own runner? Use lm-eval-harness if your eval needs match its supported tasks (MMLU, GSM8K, HumanEval, etc.) and you want reproducible numbers comparable to published results. Build your own (or use OpenAI Evals / Inspect / DeepEval as the framework) when you need custom tasks, tool-use eval, multi-turn dialog, or eval against production traces. Most production teams use both: lm-eval-harness for cross-vendor comparison, custom harness for workload-specific evaluation.
How big should my eval set be? The minimum statistically useful: ~200 examples for reasonable confidence intervals on 0.5–0.9 accuracy estimates. ~500–1000 examples for tight intervals and per-category breakdowns. Above 2000, maintenance cost dominates marginal utility unless you have many categories. The single number to track: standard error = sqrt(p(1-p)/N). At p=0.7 and N=500, SE is ~2%; at N=200, ~3.2%. Pick N based on the smallest difference you need to detect.
Are leaderboards like Chatbot Arena useful or distorted? Useful with caveats. Chatbot Arena measures chat preference under user-driven prompts; it's contamination-resistant (live questions) and large-N. Distortions: skews toward chat-style tasks (under-weights coding, reasoning, agent tasks), user demographics skew toward early adopters, style preferences contaminate quality. Use Arena as one input among many; never as the sole metric for production decisions.
How do I evaluate models I can't run locally (closed APIs)? Use a runner that supports API backends — lm-eval-harness, OpenAI Evals, Inspect all do. Match the inference protocol (system prompt, temperature, max_tokens, response_format) to what production will use. Document API version explicitly; closed APIs change over time and your number is only valid for the API version evaluated.
Is benchmark contamination really that bad? For widely-published benchmarks (MMLU, GSM8K, HumanEval): yes, contamination is documented at material levels (5–15+ percentage points on some splits). For newer benchmarks (FrontierMath, LiveCodeBench, ARC-AGI v2): much lower contamination by design. For your private evals: zero, by definition. Build your decision on the latter two; treat the former as historical context.
Should I worry about evals against models I don't control updating in production? Yes. Vendor-managed models update silently — GPT-4o today is not GPT-4o from 6 months ago. Your eval baseline drifts even when you don't change anything. Run periodic eval against versioned snapshots; track drift over time; alert on regressions of more than threshold.
How do I evaluate hallucination rate in RAG? Use a faithfulness scorer: given (question, retrieved context, answer), does the answer derive from the context? RAGAS, Patronus Lynx, Bedrock contextual grounding all do this. Run on production traces sampled across categories. Track per-category hallucination rate; alert on increases. Use spot-check humans to validate the scorer's calibration.
What's a realistic regression budget on eval scores? Per-category, allow regression up to 1–2 percentage points before alerting; up to 3–5 percentage points before blocking. Tighter on safety categories (allow 0 regression on refusal rate). Looser on noisy categories (allow more on creative writing).
How do I evaluate agents with multi-step plans? Combination: per-step correctness (was each tool call appropriate?), end-task success (did the task complete correctly?), efficiency (steps used vs minimum needed), safety (any inappropriate actions). End-task success is the headline; per-step is for debugging. SWE-Bench, GAIA, BrowseComp are public agent benchmarks; supplement with internal task-specific evals.
Can I use o3 or Claude Opus as my eval judge? Yes — frontier models are higher-quality judges than older or smaller ones. The cost is meaningful: ~$5–$15 per 1000 judgments. For high-stakes evals (release gates), worth it. For routine CI eval, a cheaper judge (Claude Sonnet, GPT-4o-mini, or self-hosted Llama 3.1 70B) is usually sufficient. Periodically validate cheap-judge results against frontier-judge results on a sample.
How do I handle eval flakiness? Three causes: (1) sampling variance — sample with deterministic seeds where possible, use temperature=0 for evals where applicable; (2) judge variance — average across judges or judge runs; (3) infrastructure flakiness (network errors) — retry with backoff. Track per-test pass rate over time; quarantine tests with high variance until investigated.
What's the right cadence for re-running evals in production? Continuous (every release): regression tests (fast, ~$10–$100 per run). Daily: a smoke test on a small sample of production traces. Weekly: full eval suite. Quarterly: golden-set refresh, judge calibration check against humans, retrospective on what evals caught vs missed in production incidents.
How do I budget for evals? Engineering: 1–2 FTE-quarters for initial harness build + first eval set; 0.25–0.5 FTE ongoing. Compute: $50–$500 per full eval run, run dozens of times per month = $1500–$15000/month at moderate engineering velocity. Human annotation: $5–$20 per annotated example for domain experts, so a 500-example refresh quarterly = $5–10k. Total typical: $50–200k/year for a serious eval program for a moderate-scope product.
Are LMSYS Chatbot Arena rankings still useful in 2026? For coarse popular-perception signal, yes; for production decisions, only weakly. Chatbot Arena measures user preference on free-form prompts, which is biased toward style, verbosity, and refusal posture in ways that don't correlate cleanly with task performance. Treat it as one signal among many; don't gate releases on it.
What's the right way to compare reasoning models on a benchmark? Report accuracy across multiple thinking-token budgets, not a single point. A "thinking-budget curve" reveals whether one model uses tokens more efficiently than another. Comparing a reasoning model at default budget to a non-reasoning model at zero budget is misleading; better is to report cost-normalized accuracy.
How do I evaluate prompt-injection resistance? Specialized benchmarks: PromptBench, the Anthropic prompt-injection eval set (where available), TensorTrust. Pattern: feed adversarial inputs designed to override the system prompt, measure rate of successful override. Important to track over time as new injection techniques emerge.
What's the difference between Inspect AI and lm-evaluation-harness? Inspect AI (UK AISI) is purpose-built for safety and agent evaluation; lm-eval-harness (EleutherAI) is the broad-spectrum benchmarking workhorse. Inspect AI has stronger support for agentic tasks (tool use, multi-step), better trace inspection, and tighter integration with safety frameworks. lm-eval-harness has more breadth (hundreds of benchmarks) and better support for academic comparison. Use both for different things.
Should I trust vendor-reported benchmark numbers? With caveats. Vendor numbers are typically run under conditions that maximize the score (best prompt template, cherry-picked seed, generous parsing). Independent reproduction often comes in 1–3 percentage points lower. For coarse comparison, vendor numbers are useful; for production decisions, run the eval yourself with your protocol.
What's the "evaluator drift" problem? LLM-as-judge models update over time; today's judge is not the same as last quarter's judge. A quality regression detected by the judge might be a real regression, a judge regression, or a calibration shift. Mitigation: pin the judge model version for any given eval run; periodically re-calibrate against human consensus; report judge version in eval metadata.
How do I evaluate a multimodal model's image understanding? MMMU (Massive Multi-discipline Multimodal Understanding), MMVet, MathVista, ChartQA, DocVQA are the headline benchmarks in 2026. Each tests different aspects: MMMU is broad and academic; MathVista tests visual math reasoning; ChartQA tests chart understanding. Multi-benchmark evaluation is more reliable than any single number.
Can I use synthetic data for eval sets? For training data, yes, frequently. For eval sets, with caveats: synthetic eval items can have systematic biases that don't appear in real data; they're useful for stress-testing specific failure modes but should not replace real-data eval sets entirely. The 2026 best practice is a hybrid: real production data for the headline eval, synthetic data for targeted stress tests on hard or rare categories.
What's the "eval distribution shift" problem? Production data drifts; eval datasets don't. After 6 months, your eval set may no longer reflect what users are actually doing. Detection: periodically compare eval-set query distribution to production query distribution (e.g., topic mix, length, complexity). Mitigation: refresh eval sets quarterly from current production traces.
How do I evaluate the cost-quality trade-off across model tiers? Build a Pareto frontier: x-axis cost-per-task, y-axis quality metric. Plot each model variant. The frontier identifies the cost-quality sweet spots; everything below the frontier is dominated. Useful for routing decisions: "for this category, route to the cheapest model on the frontier above quality threshold X."
What's the role of "rubric-based" eval? Useful for open-ended generation where reference-based scoring doesn't apply. Define a rubric (e.g., 5 criteria, each 1–5 scale); the judge scores each criterion separately; aggregate into a quality score. Rubrics expose what the judge is weighing, which makes calibration easier than holistic 1–10 scores.
How do I evaluate fairness and bias? Multi-dimensional: demographic parity (does performance vary by group?), counterfactual fairness (does swapping group membership in the prompt change the answer?), refusal-rate consistency (does the model refuse different groups at different rates?). Benchmarks: BBQ, StereoSet, the Anthropic bias eval set. Specialized tooling: Galileo, Patronus.
What's a "behavioral test suite" in eval? Test items targeting specific behaviors rather than general capability: "model must refuse requests for medical diagnosis," "model must include source URLs when citing facts," "model must not generate JSON with trailing commas." Behavioral tests catch the rare-but-important failures that aggregate metrics hide.
How do evals interact with continuous fine-tuning? Tightly. Each fine-tune run produces a checkpoint that must pass the eval suite before promotion. Eval becomes the gate of the training pipeline. Pattern: train → eval → if regression, investigate → if pass, deploy to canary → if canary metrics pass, full rollout. The eval suite is the contract between training and production.
Can I rely on Chatbot Arena style A/B preference data for production decisions? Partially. User-preference data is great for catching style and refusal regressions; it's poor for catching factual correctness regressions (users often can't verify accuracy in the moment). Pattern: use preference data for one signal in a multi-signal gate, not as the sole gate.
What's the future of eval beyond 2026? Three trends: (1) agentic eval taking over from single-turn eval as the headline; (2) safety/red-team eval becoming regulatory requirements (EU AI Act, US executive orders); (3) eval-as-a-service vendors consolidating around shared standards. The 2030 eval landscape will look very different from 2026; the core principles (workload-specific evals, contamination resistance, statistical rigor) will not.
Glossary
- Aggregate score — single number summarizing performance across many items.
- Bootstrap — statistical resampling method for computing confidence intervals.
- Calibration — alignment between predicted confidence and actual accuracy.
- Contamination — benchmark items appearing in model training data.
- Few-shot — providing example prompts before the test question.
- Goodhart's law — when a measure becomes a target, it ceases to be a good measure.
- Held-out — data not released publicly, used for clean evaluation.
- Model-graded — evaluation where another model scores the output.
- Pairwise comparison — judging which of two outputs is better.
- Protocol — the procedure used to run a benchmark.
- Rubric — explicit criteria for scoring an output.
- Zero-shot — no example prompts; just the test question.
References
- HELM — Liang et al., 2022. "Holistic Evaluation of Language Models." arXiv:2211.09110. Comprehensive framework with explicit protocols.
- BIG-bench — Srivastava et al., 2022. "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." arXiv:2206.04615.
- MMLU — Hendrycks et al., 2020. "Measuring Massive Multitask Language Understanding." arXiv:2009.03300.
- MMLU-Pro — Wang et al., 2024. "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv:2406.01574.
- HumanEval — Chen et al., 2021. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. Original pass@k formulation.
- FrontierMath — Glazer et al., 2024. "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI." arXiv:2411.04872.
- GPQA — Rein et al., 2023. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022.
- LiveCodeBench — Jain et al., 2024. "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code." arXiv:2403.07974.
- Chatbot Arena — Chiang et al., 2024. "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." arXiv:2403.04132.
- SWE-bench — Jimenez et al., 2023. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770.
- LLM-as-Judge — Zheng et al., 2023. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685. Biases in model-graded evaluation.
- Goodhart's law — Strathern, 1997. "'Improving Ratings': Audit in the British University System." European Review 5(3). The form of the law commonly cited today.
- Data contamination — Roberts et al., 2023. "Data Contamination Through the Lens of Time." arXiv:2310.10628.
- Lessons from the Trenches — Biderman et al., 2024. "Lessons from the Trenches on Reproducible Evaluation of Language Models." arXiv:2405.14782.
- lm-evaluation-harness — EleutherAI's widely-used eval framework. github.com/EleutherAI/lm-evaluation-harness.