Prompt20
All posts
agiai-progressverificationevaluationbenchmarksarc-prizemetrlevels-of-agiguide

Measuring AI Progress: Why AGI Is the Wrong Scoreboard

A 2026 field guide to how AI progress is actually measured: Greg Kamradt's 7-level verification framework, OpenAI's 5 levels, DeepMind's Levels of AGI, and METR's task-horizon curve. Why 'AGI' is a moving, personal goalpost — and why verifiability, not generality, is the metric that predicts what AI can actually do for you.

By Prompt20 Editorial · 30 min read

"Did we get AGI yet?" is the wrong question, and not because the answer is hard. It's the wrong question because AGI is not a line you cross — it's a word people use to mean whatever capability they personally are waiting for. A mathematician's AGI arrives when a model proves a theorem they couldn't. A startup founder's AGI arrives when an agent ships a product that makes money. A radiologist's arrives when a system can be trusted to read a scan unsupervised. These don't happen on the same day, and no single benchmark score announces any of them.

A more useful lens, popularized by Greg Kamradt (President of the ARC Prize Foundation) and unpacked in The Neuron's explainer, is to stop asking how smart a system is and start asking how fast reality can tell us whether it was right. Intelligence only matters when it can be verified. This guide walks through Kamradt's 7-level verification framework, places it next to the other progress frameworks you'll see cited in 2026 (OpenAI's 5 levels, DeepMind's Levels of AGI, METR's task-horizon curve), and translates all of it into something you can actually use to reason about where AI is and isn't ready to be trusted.

This pairs with LLM evaluation infrastructure (how you measure a single model honestly) and benchmark hacking (how outcome-only scoring breaks once agents can game it). This post zooms out from a single benchmark to the question those guides serve: what does "progress" even mean?

Table of contents

  1. Key takeaways
  2. Mental model: progress is a verification problem
  3. Why "AGI" is the wrong scoreboard
  4. Kamradt's 7-level verification framework
  5. The other frameworks: OpenAI, DeepMind, METR
  6. Putting them together
  7. Where frontier models actually are in 2026
  8. How to use this
  9. FAQ
  10. References

Key takeaways

  • "AGI" is a personal, moving goalpost. Different people experience it at different times because each is waiting for a different capability. A single threshold can't capture that.
  • Verifiability, not generality, is the real metric. A capability matters when reality can confirm the work succeeded — and the speed and reliability of that confirmation is what's been improving.
  • Kamradt's framework ranks work by verification difficulty, from L1 (instant, objective — math, code, chess) to L7 (civilization-scale — governance, geopolitics, judged over generations).
  • AI conquers the levels roughly in order. Fast, cheap, objective verification (L1–L2) falls first; slow, expensive, contested verification (L5–L7) holds out longest — not because the reasoning is harder, but because the feedback loop is.
  • Other frameworks measure different axes. OpenAI's 5 levels track autonomy/role; DeepMind's grid crosses competence × generality; METR measures the time-horizon of tasks AI can complete. They're complementary, not competing.
  • In 2026, frontier models saturate L1–L2 and are climbing L3. L4+ (market, scientific, institutional verification) is where the real frontier sits, gated by feedback latency more than by raw model capability.
  • Practical upshot: to predict whether AI is ready for your task, don't ask "is it AGI yet?" — ask "how fast and how cheaply can I verify its output, and can I trust the verifier?"

Quick comparison: the frameworks in one table

Framework Author / origin Axis it measures Levels Best for
7-level verification Greg Kamradt / ARC Prize How fast reality can verify the work 7 (instant → civilizational) Reasoning about trust and deployment readiness
5 levels of AI OpenAI (2024) Autonomy / organizational role 5 (chatbot → organization) Product/agent roadmaps
Levels of AGI DeepMind (Morris et al., 2023) Competence × generality 6 × 2 grid Academic capability classification
Task-horizon METR (2025) Length of task AI completes reliably Continuous (minutes → hours) Forecasting agent capability over time

Mental model: progress is a verification problem

Here's the one-minute version. Take any kind of work and ask: when the work is done, how do we know if it was good?

  • For a chess move, you know in seconds — the engine evaluates the position.
  • For a unit test, you know in seconds — it passes or it doesn't.
  • For an ad campaign, you know in days — click-through rates come back noisy.
  • For a startup, you know in years — the market either pays or it doesn't.
  • For a new drug, you know after expensive trials measured in years.
  • For an education policy, you know after a generation — and even then the counterfactual is unknowable.

This is the whole insight. The reason AI crushed chess and coding first isn't that those are "easy" — it's that they offer instant, objective, cheap verification, which means you can generate enormous amounts of training signal and optimize against it hard. The reason AI hasn't "solved" governance isn't that governance requires more IQ — it's that the feedback loop is decades long, the ground truth is contested, and you can't run the counterfactual. The bottleneck on progress is increasingly the verifier, not the model.

Verification gets harder ↓ (slower, costlier, more contested)

L1  math, code, chess          seconds      objective, cheap
L2  software eng, security     minutes      fast but incomplete
L3  copywriting, design        hours/days   human-judged, noisy
L4  startups, investing        months/years market-judged
L5  biology, medicine, robots  years        experiment-judged, expensive
L6  education, law, planning   years+       institution-judged, weak counterfactuals
L7  governance, culture        generations  civilization-judged, unknowable counterfactuals

Everything below is an elaboration of this picture.

Why "AGI" is the wrong scoreboard

The term "artificial general intelligence" carries three hidden problems that make it useless as a progress metric:

1. It's personal. What counts as "general enough" depends on what you need the system to do. Kamradt frames this as Human Special Intelligence (HSI) — every person has a unique skill profile, so everyone is effectively waiting for their own AGI. The day a model can do your specific bundle of work is the day it feels like AGI to you, and that day differs for everyone. There is no shared finish line to cross.

2. It's a threshold, and progress isn't. "AGI" implies a binary — before and after. But capability arrives unevenly across domains. A 2026 model can write production code (a hard, high-skill task) but cannot be trusted to run a city's transit policy (a task many humans do adequately). Calling either state "AGI / not-AGI" throws away all the structure that actually matters.

3. It's unfalsifiable in practice. Because the definition floats, "we have AGI" and "we don't have AGI" can both be defended indefinitely. A metric you can't lose is a metric you can't learn from. Compare this to a benchmark with a clear protocol — see eval infrastructure for why protocol-pinning is what makes a number mean anything.

The fix is not a better definition of AGI. It's to replace the single question with a structured one: not "how smart is it?" but "across which kinds of verifiable work can it now be trusted, and how quickly can that trust be established?"

Kamradt's 7-level verification framework

The framework ranks kinds of work by how reality verifies them. Crucially, the levels are ordered by verification difficulty, not task difficulty — the speed, cost, completeness, and contestability of the feedback that tells you whether the work succeeded.

Level Domain examples Feedback latency Verification character
L1 — Instant, objective Math, code, formal proofs, chess/Go Seconds–minutes Objective, complete, cheap. A checker says yes/no.
L2 — Fast but incomplete Software engineering, data analysis, security testing Minutes–hours Tests pass, but coverage is partial — green tests ≠ correct system.
L3 — Human-evaluable creative work Copywriting, design, sales decks Hours–days A human judges it; feedback is real but noisy and subjective.
L4 — Market-verifiable Startups, investing, hiring Weeks–years The market is the judge. Slow, high-variance, confounded by luck.
L5 — Experimentally verifiable science Biology, medicine, robotics, materials Months–years Ground truth exists but experiments are expensive and slow.
L6 — Institutionally verifiable Education, law, urban planning Years Long cycles, weak counterfactuals, institutional mediation.
L7 — Civilization-scale Governance, culture, geopolitics Decades–generations Judged by history; the counterfactual is unknowable.

Three things make this framework useful rather than just tidy:

AI advances down the list roughly in order. The levels predict the sequence of automation. We got superhuman chess (L1) decades ago, superhuman competitive programming recently (L1–L2), and useful coding agents now (L2). Creative work (L3) is mid-disruption — models produce passable copy and design, gated by the noisiness of human judgment. L4 and below remain frontier precisely because you cannot generate fast training signal when the verifier takes years.

The bottleneck is the feedback loop, not the IQ. This reframes a lot of debates. "Why can't AI run a company?" isn't mainly about reasoning horsepower — it's that you get one noisy data point every few years, so neither the model nor the humans evaluating it can learn fast. Domains with slow verification resist optimization regardless of model quality.

It tells you where to trust AI today. The higher the level, the more a human must stay in the loop — not because the model is dumber there, but because nothing can quickly confirm it was right. At L1 you can let a verifier gate the output automatically; at L7 you're making an unverifiable bet.

The take. The frontier of AI isn't moving "toward AGI." It's moving down the verification ladder — converting more and more kinds of work from "a human has to judge this slowly" into "a fast, cheap, trustworthy check can gate it." Every rung you push verification down is a rung where AI suddenly becomes deployable at scale. That, not a generality threshold, is the thing to watch.

The other frameworks: OpenAI, DeepMind, METR

Kamradt's is one of several lenses. The others measure different axes, and the confusion in most "are we at AGI" arguments comes from people using different ones without saying so.

OpenAI's 5 levels (autonomy / role)

OpenAI's internal framework (made public in 2024) ranks AI by the role it can play:

  1. Chatbots — conversational language ability.
  2. Reasoners — human-level problem-solving across domains.
  3. Agents — systems that take autonomous action over extended tasks.
  4. Innovators — AI that can aid in invention and discovery.
  5. Organizations — AI that can run the work of an entire organization.

This is an autonomy/role axis, useful for product and agent roadmaps. Note how it implicitly tracks Kamradt's ladder: "Innovators" (L4–L5 verification: science, markets) and "Organizations" (L6–L7) are exactly the slow-verification regimes.

DeepMind's Levels of AGI (competence × generality)

Morris et al. (2023) propose a two-dimensional grid: a competence axis (Level 0 No AI → 1 Emerging → 2 Competent → 3 Expert → 4 Virtuoso → 5 Superhuman) crossed with a narrow vs. general axis. So "Superhuman Narrow AI" (AlphaFold, Stockfish) is distinct from "Competent General AI." The key contribution is decoupling how good from how broad — and insisting that capability be measured on benchmarks of real-world tasks, not vibes. See eval infrastructure for why that benchmarking is harder than it sounds.

METR's task-horizon curve (time)

METR's 2025 work measures something concrete: the length of task a model can complete reliably (at 50% success). Their finding — that this "task horizon" has been roughly doubling every ~7 months — turns progress into a forecastable trend line. If a model can reliably do tasks that take a human ~2 hours today, extrapolation says multi-day autonomous tasks within a few years. This is the quantitative cousin of Kamradt's L2→L3→L4 progression: longer tasks generally mean slower, more incomplete verification. (We track per-model horizon numbers in the data app's agent-horizon view.)

Putting them together

The four frameworks are not rivals — they're projections of the same elephant onto different axes:

Axis Framework Question it answers
Trust / verification speed Kamradt 7 levels Can I confirm it was right, and how fast?
Autonomy / role OpenAI 5 levels How much can it do on its own?
Competence × breadth DeepMind Levels of AGI How good, and how general?
Time METR task-horizon How long a task can it sustain?

They correlate but aren't redundant. A model can be Superhuman-Narrow (DeepMind) at an L1 task (Kamradt) — that's Stockfish, and it's not an "Agent" (OpenAI) at all. A model can have a long task horizon (METR) yet sit at L4 verification, meaning we still can't quickly tell if its long autonomous run was good. The richest read on any system uses all four: how good, how general, how autonomous, and how verifiable.

Where frontier models actually are in 2026

Mapping the current frontier onto Kamradt's ladder (deliberately qualitative — exact placement is contested and model-dependent):

  • L1 (instant, objective): saturated. Frontier models are at or beyond strong-human level on competition math and competitive programming. With a verifier in the loop, output can be auto-gated. This is also where reinforcement learning works best, because the reward is clean — see post-training.
  • L2 (fast, incomplete): rapidly maturing, with a catch. Coding agents resolve real GitHub issues, but incomplete verification is exactly what benchmark hacking exploits — green tests don't mean the system is right, and network-enabled agents learn to game the gap.
  • L3 (human-evaluable creative work): mid-disruption. Models produce usable copy, design, and decks; the ceiling is the noisiness of human judgment, which both limits training signal and makes "how good is it really" genuinely hard to answer.
  • L4 (market-verifiable): frontier. AI assists investing, hiring, and product decisions, but cannot be trusted unsupervised — the market's verdict is too slow and confounded to close the loop.
  • L5–L7 (science, institutions, civilization): assistive only. AI accelerates literature review, hypothesis generation, and drafting, but the expensive, slow, or unknowable verification means a human owns the decision. The constraint here is structural, not a model-capability gap that the next training run closes.

The pattern: the frontier is the verification boundary, currently sitting around L3–L4. Progress looks like pushing that boundary down — finding ways to make slower-verified work faster to check (better simulators for L5 robotics, faster market proxies for L4, rubric-based judges for L3).

How to use this

When you're deciding whether AI is ready for a task, skip "is this AGI?" and run three questions instead:

  1. How fast and cheap is verification? If you can check the output in seconds with an objective test (L1–L2), you can deploy with automated gates and let the model run. If verification takes weeks and is subjective (L4+), keep a human firmly in the loop.
  2. Can you trust the verifier? A fast verifier you can't trust is worse than a slow one you can — it's what lets reward hacking slip through. Invest in the checker, not just the model.
  3. What's the cost of a wrong answer that ships unverified? Low cost + fast verification = automate aggressively. High cost + slow verification = AI drafts, human decides.

This is also a roadmap for builders: the highest-leverage AI work right now is often building the verifier — turning an L4 task into something with an L2-speed proxy check. Whoever makes a domain's verification faster and cheaper unlocks the next wave of automation in it.

FAQ

Q: Who created the 7-level verification framework? Greg Kamradt, President of the ARC Prize Foundation. The framing was popularized through his talks and writing and summarized in The Neuron's explainer. It builds on a long line of thinking about verification and reward in AI.

Q: Is this framework a replacement for AGI as a concept? It's a replacement for AGI as a scoreboard. "AGI" can stay as a loose aspirational term; the argument is that it's useless for measuring progress, and that verification difficulty is a far more predictive and falsifiable axis.

Q: How is this different from OpenAI's 5 levels or DeepMind's Levels of AGI? They measure different axes. OpenAI's levels rank autonomy/role (chatbot → organization); DeepMind's grid crosses competence with generality; Kamradt's ranks how fast reality can verify the work. METR's task-horizon adds a time axis. Use them together, not instead of each other.

Q: Why did AI master chess and coding before "easier" everyday tasks? Because chess and code offer instant, objective, cheap verification (L1), which produces abundant training signal you can optimize against. Many "easy for humans" tasks (L3+) have slow, noisy, or expensive verification, so there's little fast signal to learn from — the difficulty is in the feedback loop, not the task.

Q: Does a longer METR task-horizon mean we're approaching AGI? It means models can sustain longer autonomous tasks, which is real and forecastable progress. But horizon length doesn't tell you whether the long run was correct — that's the verification axis. A long-horizon agent on an L4 task is still something you can't quickly trust.

Q: What's the single most actionable takeaway? To judge AI readiness for any task, ask how fast and cheaply you can verify its output and whether you trust the verifier. Fast-and-trustworthy → automate. Slow-or-untrustworthy → keep a human in the loop. The frontier of useful AI is the frontier of cheap verification.

References

  • The Neuron — "AGI Is the Wrong Scoreboard" explainer of Kamradt's 7-level framework. theneuron.ai.
  • ARC Prize Foundation — Greg Kamradt's work on abstraction, reasoning, and verifiable progress. arcprize.org.
  • DeepMind — Levels of AGI — Morris et al., 2023. "Levels of AGI for Operationalizing Progress on the Path to AGI." arXiv:2311.02462.
  • METR — Measuring task horizons — 2025. "Measuring AI Ability to Complete Long Tasks." arXiv:2503.14499.
  • OpenAI's 5 levels of AI — reported framework ranking AI from chatbots to organizations (Bloomberg, 2024).
  • Goodhart's law — Strathern, 1997. "'Improving Ratings': Audit in the British University System." European Review 5(3). Why a metric that becomes a target stops measuring what it did.