AI Coding Agents: The Ultimate Guide (Cursor, Claude Code, Codex CLI, Devin, Aider, Cline, and the Stack Around Them)
Comprehensive 2026 guide to AI coding agents — the IDE stack (Cursor, Windsurf, Zed), the CLI stack (Claude Code, Codex CLI, Aider, OpenHands, Goose, Gemini CLI), the autonomous-agent stack (Devin, Manus, Lovable), the harnesses underneath (OpenClaw, SWE-agent), the model choices, the benchmarks (SWE-Bench Pro, Terminal-Bench 2, ClawEval, PinchBench), the economics, and how production teams actually compose them.
In 2023 the AI coding assistant story was Copilot — single-file completion inside VS Code, driven by a tuned GPT-3.5 derivative. In 2024 Cursor proved that forking the IDE could ship dramatically better UX than the plugin model; Cognition's Devin promised autonomous "AI software engineer" agents and the term "vibe coding" entered the lexicon. In 2025 Claude Code, Aider, OpenHands, and the early Codex CLI established the CLI agent as a legitimate primary work surface. By mid-2026 the landscape has matured: roughly five categories of tools, each with 3-8 credible options, and a coherent stack of model + harness + IDE + protocol underneath. The teams that ship the most code with AI are not the teams using the single "best" tool — they're the teams running 2-4 of these in parallel, with explicit handoff patterns.
The take: in 2026 there are four practical coding-agent surfaces you'll choose from. IDEs (Cursor, Windsurf, Zed AI, Continue + VS Code) — best when you want a chat panel and a tab-completion model integrated with editor state. CLIs (Claude Code, Codex CLI, Aider, OpenHands, Gemini CLI, Goose, Cline, Roo, Qwen Code, Kimi CLI) — best for terminal-native workflows, agent loops that span repos, headless CI integration, and when you want to mix-and-match underlying models. Autonomous web agents (Devin, Manus, Lovable, GitWit, Magic Patterns) — best for "ship a feature from a ticket" workloads where you don't watch each step. App builders (Lovable, Bolt, v0, Replit Agent) — best for greenfield consumer-app prototypes. The choice between these is less about which is "best" and more about which fits your loop. This guide is the map: who ships what, the harness layer underneath (OpenClaw, SWE-agent), the model choices, the benchmarks that actually predict real-world utility (SWE-Bench Pro, Terminal-Bench 2, ClawEval, PinchBench, not HumanEval), the cost math, and the production patterns and anti-patterns that have shaken out.
Companion reading: open weights ultimate guide for the model side, agent protocols (MCP / A2A / ACP) for the connector layer underneath, agent serving infrastructure for the runtime, eval infrastructure for trace-based testing, post-training: RLHF and DPO for the underlying RL methods, and benchmark hacking for why most coding benchmarks are no longer trustworthy as-is.
Table of contents
- Key takeaways
- Mental model: four agent surfaces
- The IDE stack (Cursor, Windsurf, Zed, Continue, GitHub Copilot Workspaces)
- The CLI stack (Claude Code, Codex CLI, Aider, OpenHands, Gemini CLI, Goose, Cline, Roo, Qwen Code)
- The autonomous web-agent stack (Devin, Manus, Lovable, GitWit)
- The app-builder stack (Lovable, Bolt, v0, Replit Agent, Magic Patterns)
- Harnesses underneath: OpenClaw, SWE-agent, AutoCodeRover, MetaGPT
- Model choice: which LLM under which agent
- The benchmark reality check (SWE-Bench Pro, Terminal-Bench 2, ClawEval, PinchBench)
- The MCP integration layer (filesystem, git, GitHub, Linear, Sentry servers)
- Cost economics: per-task, per-developer, per-month
- Latency and feedback-loop design
- Where each tool wins (concrete recommendations)
- Production patterns that work in 2026
- Anti-patterns
- Security and supply-chain risks
- The 2026 → 2027 outlook
Key takeaways
- Four agent surfaces: IDE (Cursor, Windsurf, Zed AI, Continue), CLI (Claude Code, Codex CLI, Aider, OpenHands, Gemini CLI, Goose, Cline, Roo, Qwen Code), autonomous web (Devin, Manus, Lovable, GitWit), app builder (Lovable, Bolt, v0, Replit Agent, Magic Patterns). Most production teams use 2-3 in parallel.
- Cursor (Anysphere) leads the IDE category in user base; Windsurf (acquired by OpenAI, formerly Codeium) is the close second; Zed AI is the fastest-growing for terminal-native developers. Continue is the leading open-source extension.
- Claude Code (Anthropic) is the leading CLI agent by adoption in 2025-2026, with Codex CLI (OpenAI), Aider (oldest, open-source), and OpenHands (All Hands AI) as the most credible alternatives. Gemini CLI (Google) and the Chinese variants (Qwen Code, Kimi CLI) round out the field.
- Devin (Cognition) remains the standard autonomous-agent demo; it's competitive but expensive ($500/mo per seat). Manus (Butterfly Effect, China) is the open-Chinese equivalent gaining traction in 2026.
- Lovable, Bolt, v0, Replit Agent, Magic Patterns dominate the prototype-an-app surface. Best for greenfield React/Next apps; rapidly improving but still weaker than human-built for production codebases.
- The harness layer matters. Claude Code uses Anthropic's native skills/plugins/hooks; OpenClaw (Xiaomi) is the open Claude-Code-compatible harness powering MiMo and others; SWE-agent (Princeton) is the academic standard underlying many evals. Choose the harness, then the model, then the surface.
- Benchmark to trust in 2026: SWE-Bench Pro (replaces SWE-Bench Verified for frontier discrimination), Terminal-Bench 2.0 (agent-task realism), ClawEval / PinchBench (OpenClaw harness), Aider Polyglot. HumanEval and MBPP are saturated and barely discriminating at the frontier — ignore.
- Model under each agent matters at least as much as the agent. Claude Opus 4.7 + Aider often beats GPT-5.5 + Cursor; same agent with different models can differ 10-20% on real tasks.
- MCP (Model Context Protocol) is the de-facto connector layer in 2026: every major IDE and CLI agent supports it, and the GitHub / Linear / Sentry / Notion / filesystem MCP servers are the most-installed integrations.
- Cost ranges 10× across the category: $20-50/mo per developer for Cursor/Windsurf Pro, $50-200 for serious CLI usage paying per-token, $500/mo for Devin seats, and approaching $1k+ for heavy enterprise Cursor Ultra / Codex Pro plans.
- Production teams use parallel loops: tab-complete model in IDE for routine, Claude Code or Codex CLI for multi-file refactors, Devin / Manus for bounded async tickets, and human review for everything above trivial size.
Mental model: four agent surfaces
A 2026 coding agent product fits in one of four categories, defined by how the developer interacts with it and what level of autonomy is implied:
1. IDE-resident agents (synchronous, in-editor): a chat panel + tab completion + selection-driven edits, integrated with your editor's state (open files, cursor position, diagnostics). You stay in the IDE; the agent assists move-by-move. Examples: Cursor, Windsurf, Zed AI, Continue, GitHub Copilot Workspaces.
2. CLI agents (synchronous, terminal-native): a terminal program you run that takes a natural-language task, plans, executes shell commands, edits files, and reports back. You watch (or skim) each step in your terminal. Examples: Claude Code, Codex CLI, Aider, OpenHands, Gemini CLI, Goose, Cline (technically VS Code-resident but CLI-flavored), Roo Code, Qwen Code, Kimi CLI.
3. Autonomous web agents (async, browser-or-cloud): a hosted agent that takes a ticket-shaped task, runs for minutes-to-hours, and returns a PR or report. You don't watch each step; you watch the rate of merged PRs. Examples: Devin (Cognition), Manus (Butterfly Effect), Lovable (for app projects), GitWit.
4. App builders / vibe coding (synchronous, conversational, greenfield): a chat-driven product that builds a Next.js / React app from scratch, deploys to its own infrastructure (or your Vercel), and lets you iterate by chat. Examples: Lovable, Bolt (StackBlitz), v0 (Vercel), Replit Agent, Magic Patterns, Webcrumbs, Trickle, Codev, Tldraw Computer, Open Lovable (Mendable).
The categories blur:
- Cursor has a "background agent" mode that's more like category 3.
- Devin can be driven via a chat UI that feels like category 4.
- Lovable can be invoked as a CLI for category-2-like workflows.
But the four-category split holds for most decisions.
The IDE stack
The fork-VS-Code era is mature; the plugin era is shrinking. The 2026 contenders:
Cursor (Anysphere, San Francisco) — VS Code fork. Composer mode (multi-file edit), Tab completion (Cursor's custom small model), Chat, and now Cursor Agents (background autonomous tasks). The leading paid AI IDE by user count. Pricing: $20/mo Pro, $40/mo Business, $200/mo Ultra. Strong on tab completion latency and multi-file context. Backed by a $9B+ valuation as of 2025-2026; deep model partnerships (Claude, GPT, Gemini, Grok).
Windsurf (acquired by OpenAI from Codeium, late 2025) — VS Code fork. Cascade is the flagship "flow-aware" agent that knows what you just did and plans accordingly. Now OpenAI-funded and tightly integrated with OpenAI models (Codex CLI shares engineering). Pricing converging with Cursor.
Zed AI (Zed Industries) — native Rust editor with built-in AI features. Smallest install, fastest IDE. Strong appeal for vim-flavored / minimalist developers. AI features include chat-side panel and agent panel with MCP integration. Pricing: free + paid AI tiers.
Continue (Continue Dev) — open-source VS Code + JetBrains extension. The leading open IDE agent; brings your own model. Pricing: free OSS; managed cloud for teams.
GitHub Copilot Workspaces — the GitHub-native answer. Spec → plan → implementation → review flow. Tight GitHub Issues + Actions integration. Pricing: included in Copilot Business / Enterprise.
JetBrains AI Assistant + Junie — JetBrains' answer for the IntelliJ family. Junie is the autonomous-task agent. Strong for Java/Kotlin/Python shops on IntelliJ.
Amp (Sourcegraph) — Sourcegraph's evolution of Cody. Code-graph-aware (Sourcegraph's strength). Strong for enterprises with large monorepos.
Cody (Sourcegraph) — older Sourcegraph product, being superseded by Amp.
TabbyML — open-source, self-hosted IDE assistant for code completion. Privacy-first.
Picking among IDEs:
- Cursor for default — best UX, strongest community, most model options.
- Windsurf if you're an OpenAI shop or prefer Cascade's flow-aware UX.
- Zed AI for terminal-native / Rust / minimalist developers.
- Continue for self-hosted open weights + privacy.
- Copilot Workspaces for GitHub-Issue-driven shops.
- JetBrains Junie / AI Assistant for IntelliJ family.
- Amp for very large monorepos where code-graph matters.
The CLI stack
The biggest growth category of 2025-2026. The 2026 contenders:
Claude Code (Anthropic) — official Anthropic CLI. The category leader by adoption. Skills, plugins, hooks, subagents, MCP-native. Designed for Claude models but works with any Anthropic-API-compatible endpoint (which includes Kimi K2.6 via Moonshot's Anthropic-compat API). Pricing: included with Claude Pro / Team / Enterprise; pay-per-token for API users.
Codex CLI (OpenAI) — OpenAI's response to Claude Code. Open-source (MIT). Works with GPT-5 family. Strong on shell-task execution and structured output. Tight integration with the OpenAI Responses API.
Aider (Paul Gauthier) — the original. Open-source, Python. Minimal, model-agnostic, git-aware. Strong adoption among developers who want a small, hackable tool with their own model choice. The most-cited "still works great" tool in the space.
OpenHands (All Hands AI) — open-source. Originally OpenDevin. Browser + terminal + editor capabilities in one agent. Cloud and self-hosted versions; strong on benchmark performance (SWE-Bench Verified leaderboard regular).
Gemini CLI (Google DeepMind) — Google's CLI agent for Gemini models. Open-source. Tight Vertex AI / Google Cloud integration.
Goose (Block) — open-source. Strong on extensibility via "toolkits"; good MCP support. Polished UX from Block's design team.
Cline (open-source) — Claude-focused CLI / VS Code extension. Plan-act mode separation. Strong adoption in the open-source community.
Roo Code (open-source fork of Cline) — multi-mode operation; community-driven extension of Cline.
SWE-agent (Princeton NLP) — academic; the harness behind many published SWE-Bench results. Less polished UX but used as the benchmark-standard scaffolding.
Qwen Code (Alibaba) — Qwen-family CLI. Apache 2.0. Strong on Chinese-language coding tasks.
Kimi CLI (Moonshot) — Kimi K2.6 native CLI. Anthropic-compatible API so Claude Code also works against it.
OpenCode (Anomaly / SST) — newer entrant, MIT, fast-iterating.
Hermes-Agent (Nous Research) — open-source, designed for Nous Hermes models but works with most open weights.
Picking among CLI agents:
- Claude Code for default — most-polished UX, strongest plugin/skill ecosystem.
- Codex CLI for OpenAI shops or when you want OSS-licensed CLI from a major vendor.
- Aider for hackable / minimalist / model-agnostic workflows.
- OpenHands for browser+terminal+editor agent tasks and self-hosted deployment.
- Gemini CLI for Google-shop / Vertex AI integration.
- Goose for extensibility-first or Block-ecosystem teams.
- Cline / Roo for VS Code-integrated CLI experience.
- Qwen Code / Kimi CLI for Chinese-model-first workflows.
The autonomous web-agent stack
The "ticket → PR" autonomous category. Smaller than IDE / CLI but high-profile.
Devin (Cognition Labs) — the category-defining product, launched March 2024. Hosted in Cognition's cloud; you brief Devin via chat / Slack / Linear; it runs in its own VM and returns a PR. Pricing: $500/mo per seat (was higher at launch). Devin 3 (late 2025) significantly more reliable than launch version. Now strong on bounded refactors, dependency upgrades, and small features.
Manus (Butterfly Effect, China) — Chinese answer to Devin. Strong agentic capabilities. Often invoked from a chat interface. Released open API late 2024; growing global use. Manus 2 (mid-2025) is the current generation.
Lovable (Lovable AI) — primarily app-builder (category 4) but increasingly used for autonomous tickets in greenfield Next.js / Supabase projects.
GitWit — autonomous coding agent focused on the "complete a Linear ticket" workflow.
Code (Cognition's second-generation pro tool, late 2026) — Cognition's evolution beyond Devin; tighter ops integration; planning improvements.
OpenHands Cloud — managed version of OpenHands offering Devin-style async workflows.
Replit Agent (Async mode) — Replit's autonomous workflow tier.
Picking:
- Devin is still the best default for bounded autonomous tasks in production codebases, despite cost.
- OpenHands Cloud is the credible open-source-backed alternative.
- Manus for Chinese-market or cost-sensitive workflows.
- Lovable / Replit Agent for greenfield app territory.
The category is real but more limited than 2024 hype suggested. Most teams using these have 1-3 tickets per week per developer in flight, not 10+. The reliability profile rewards bounded, well-specified work; degrades on ambiguous or large-scale tasks.
The app-builder stack
Greenfield-prototype-an-app territory. The 2026 contenders:
- Lovable — leading conversational app builder. Generates full-stack Next.js + Supabase apps. Strong UX for non-engineers; growing engineering use for prototypes.
- Bolt (StackBlitz) — open-source; in-browser WebContainer execution; very fast iteration.
- v0 (Vercel) — UI-component-first; tight Next.js + Vercel deployment integration.
- Replit Agent — leverages Replit's hosting + DB stack for one-stop generation + deployment.
- Magic Patterns — design-system-aware; popular with teams that want consistent component output.
- Webcrumbs — newer entrant; lower barrier for non-coders.
- Trickle — workflow / agent-app builder; less code-focused.
- Codev — code-first; opinionated stack.
- Tldraw Computer — visual-canvas-driven app builder; experimental.
- Open Lovable (Mendable) — open-source Lovable clone.
- Same — newer competitor in the same space.
- GitWit — overlaps; both app-builder and autonomous-ticket.
Picking:
- Lovable for default — most-polished UX.
- v0 if you're a Vercel shop or want React-component-only output.
- Bolt for WebContainer-based ephemeral prototypes.
- Replit Agent for one-stop hosted prototypes with DB.
- Magic Patterns for design-system-consistent output.
Honest limitation: all of these are great for "build this prototype" and weak for "extend this 100k-line existing codebase." Don't use them as primary tools on large established repos.
Harnesses underneath
The harness is the agent loop architecture: how the agent plans, calls tools, observes results, and decides next steps. The user-facing tool is often a thin wrapper around a harness.
Claude Code's native harness (Anthropic) — built around Anthropic's tool-use, skills (reusable subagent profiles), and hooks (event-driven scripts). Closed but extensively documented.
OpenClaw (Xiaomi) — open-source Claude-Code-compatible harness. Powers MiMo coding deployments. Designed to be a drop-in for Claude Code with open weights underneath. Anthropic-API-compatible.
SWE-agent (Princeton NLP) — academic harness; the standard for SWE-Bench Verified leaderboard submissions. Minimal but well-instrumented; widely used in research.
AutoCodeRover — academic; spectrum-driven program repair harness.
MetaGPT — multi-agent harness with role-playing (PM, architect, engineer, QA). Older but still cited.
OpenHands harness — open-source, browser + terminal + editor; the underlying scaffolding behind OpenHands product and OpenHands Cloud.
Aider's harness — minimal git-diff-based; the most-imitated "small clean" harness.
Goose's toolkit harness (Block) — modular; tools registered as named "toolkits."
Devin's harness (Cognition, closed) — proprietary; details inferred from output behavior. Strong planner / replanner; persistent VM with file system, shell, browser.
Hermes-Agent harness (Nous Research) — open-source; designed for Nous models but model-agnostic.
Codex CLI harness (OpenAI) — open-source, MIT.
Choosing a harness matters when you're: building your own agent on top, evaluating quality across harnesses, or measuring why a model performs differently in two products. Most end-users don't pick the harness directly — they pick a product and inherit its harness.
Model choice: which LLM under which agent
The agent is half the equation; the model is the other half. Same agent + different model = 10-20% different outcomes on real tasks.
Best closed models for coding (May 2026):
- Claude Opus 4.7 — generally the strongest on complex multi-file refactors and reasoning-heavy work. Default for Claude Code.
- Claude Sonnet 4.6 — cost-efficient workhorse for routine work.
- GPT-5.5 / GPT-5.4 — top on competitive-programming-style problems; strong default for Codex CLI.
- Gemini 3.1 Pro — strong on long-context work (>1M tokens); best for "understand this enormous repo first" workflows.
- Grok 4.20 — strong on math/science-heavy code.
Best open weights for coding:
- DeepSeek V4 / V3.2 — best open-weight code model on most public benchmarks; serves cheaply.
- Qwen 3.6-Plus / 35B-A3B / 27B — strong; Apache 2.0; good multilingual code support.
- GLM-5.1 / 5V Turbo — strong on agentic harness benchmarks (ClawEval, OpenClaw).
- Kimi K2.6 — long-context flagship; works with Claude Code via Moonshot's Anthropic-compat API.
- MiMo V2.5 / V2.5-Pro (Xiaomi) — OpenClaw harness sweet-spot; specifically tuned for the harness.
- Codestral / Devstral (Mistral) — Apache; strong on routine code generation.
Model-agent affinity:
- Cursor / Windsurf / Zed: use multiple — Claude for chat/refactor, OpenAI for tab completion (Cursor uses its own custom model for tab; users pick for chat).
- Claude Code: Claude family (Opus, Sonnet, Haiku) by default; supports Kimi K2.6 via Anthropic-API-compat.
- Codex CLI: GPT-5 family by default.
- Aider: model-agnostic; commonly run with Claude or GPT or DeepSeek.
- OpenHands: model-agnostic; benchmarks show DeepSeek + OpenHands competitive with Claude + Claude Code on SWE-Bench.
- Devin: Cognition-tuned; doesn't expose model choice in standard product.
- Gemini CLI: Gemini family.
- Qwen Code: Qwen family.
The benchmark reality check
Which benchmarks actually predict real-world performance in 2026?
Trustworthy (tier A-B):
- SWE-Bench Pro — harder + less-contaminated successor to SWE-Bench Verified. The benchmark to cite in 2026.
- SWE-Bench Multilingual — extends Verified beyond Python. Useful for polyglot teams.
- Terminal-Bench 2.0 / Terminus-2 — agentic terminal task realism. Strong correlation with real CLI-agent quality.
- ClawEval (Xiaomi) — OpenClaw-harness-specific; signals harness quality more than model quality.
- PinchBench (Xiaomi) — OpenClaw + cost-per-trajectory; usefully reveals token-efficiency differences.
- Aider Polyglot benchmark — Aider-specific; well-curated multi-language tasks.
- LiveCodeBench — monthly rotation of competitive-programming problems. Contamination-resistant.
Approaching saturation / contamination-suspected (tier B-C):
- SWE-Bench Verified — still cited; widely contaminated and Berkeley RDI showed harness-layer exploits ("Exploiting AI Agent Benchmarks", Apr 2026).
- HumanEval / MBPP — saturated, in training corpora. Skip.
- APPS — older, partially contaminated.
Domain-specific / specialty:
- OJBench — competitive-programming Olympiad problems.
- SciCode — scientific computing.
- Design2Code — UI mockup → React.
- BigCodeBench — broader programming task suite.
Caveat: even tier-A benchmarks correlate ~0.6-0.8 with real-world team productivity gains. The best measurement remains your own gold set of 20-50 representative tasks from your codebase.
The MCP integration layer
In 2026 every serious coding agent supports MCP. The most-installed MCP servers in coding workflows:
- Filesystem MCP — reference server; every agent ships this by default.
- GitHub MCP — issues, PRs, code search, repo state.
- Git MCP — local git ops via standardized interface.
- Linear MCP — ticket / project management.
- Sentry MCP — error context for "fix this bug" tasks.
- Postgres / SQLite MCP — DB schema and query access.
- Slack MCP — pull discussion context.
- Notion MCP — doc context.
- Browser MCP (Playwright / Puppeteer) — for verification of UI changes.
- Stripe MCP, AWS MCP, Cloudflare MCP, Vercel MCP — for deployment-aware agents.
The "right" stack for most teams: filesystem + git + GitHub + Linear + Sentry. Skip the rest until you have a specific need.
See agent protocols (MCP / A2A / ACP) for the deep dive on MCP itself.
Cost economics
Three cost regimes:
Per-developer flat-rate (most predictable, easiest to budget):
- Cursor: $20/mo Pro, $40 Business, $200 Ultra.
- Windsurf: similar ranges.
- Zed AI: free + $20-30/mo tiers.
- Continue: free OSS; small team fees.
- GitHub Copilot: $10 Individual, $19 Business, $39 Enterprise.
Per-token (CLI agents that pass through API costs):
- Claude Code: passes through your Anthropic API cost ($3/M input, $15/M output for Opus). Heavy users spend $200-500/mo.
- Codex CLI: same shape with OpenAI ($1.25/$10 for GPT-5).
- Aider: model-agnostic; usually $50-200/mo for moderate use with frontier models.
- OpenHands: same shape; self-hostable with open weights for near-zero token cost.
Per-seat for autonomous agents:
- Devin: $500/mo per seat.
- Cognition's Code: enterprise pricing.
- Manus: cheaper; per-task pricing in some tiers.
Hybrid heavy users: $20 Cursor + $200/mo Claude Code API + $500 Devin = $720/mo per developer if all-in. Compared to ~$10k/mo fully-loaded developer cost, this is ~7% — and the productivity multiplier is well-established at 1.5-3× for routine work. The cost question is rarely the limiter; the integration / culture / review-bottleneck questions are.
Latency and feedback-loop design
Coding-agent productivity is dominated by feedback-loop latency. Three regimes:
Tab-completion — must be <100ms p99. Cursor's tab uses a small custom model for this reason. GitHub Copilot, Continue, Zed: similar.
Chat / multi-file edit — 1-10s acceptable. Most agents land here for Claude / GPT calls.
Agent task — 30s-30min acceptable depending on scope. Background agents (Cursor Agents, Devin, Manus) live here.
Design heuristic: the IDE chat should feel snappy enough that you don't switch tabs; the CLI agent should be predictable enough that you can supervise without losing focus; the autonomous agent should hand back results well within your work block.
Where each tool wins (concrete recommendations)
If you live in VS Code and want maximum AI assistance: Cursor Pro ($20/mo).
If you want OpenAI deeply integrated and prefer Cascade: Windsurf.
If you're a minimalist / Rust / vim person: Zed AI.
If you want self-hosted open-weight AI in your IDE: Continue + a local Qwen 3.6-27B or DeepSeek V3.2 via Ollama or vLLM.
If your team works through GitHub Issues: GitHub Copilot Workspaces + Cursor for actual editing.
If you live in JetBrains: JetBrains AI Assistant + Junie.
If you want a CLI that's the de-facto standard: Claude Code.
If you want an OpenAI-shop CLI: Codex CLI.
If you want a small hackable CLI that's model-agnostic: Aider.
If you want browser + terminal + editor in one agent, self-hostable: OpenHands.
If you want Google-shop CLI: Gemini CLI.
If you want Block / extensibility-first: Goose.
If you want to use the Cline workflow: Cline (or Roo for the more modular fork).
For autonomous ticket completion: Devin (with OpenHands Cloud as the open-backed alternative).
For greenfield app prototypes: Lovable (with v0, Bolt, Replit Agent as alternatives).
For very large monorepos: Amp (Sourcegraph).
For PR review / code search across an org: Sourcegraph + Cursor or Greptile (specialist).
Production patterns that work in 2026
What we see in successful teams:
- Parallel surfaces: developers use 2-3 of (IDE agent + CLI agent + autonomous agent), each for the workflow it fits. Don't try to consolidate to one.
- CLI agent for refactors, IDE for live coding: Claude Code or Codex CLI for "rename X across N files," IDE for the move-by-move work.
- Devin for bounded async tickets (dependency upgrades, small feature work with clear acceptance criteria) — not for "improve the architecture."
- MCP server library: maintain an internal MCP server for company-specific integrations (deploy, monitoring, internal APIs). Single source of truth.
- Eval gold set: 20-50 representative tasks from your codebase; re-run when adopting a new model / agent / version. Companion: eval infrastructure.
- Human review remains mandatory above trivial scope. The 2026 productivity story is "AI drafts faster; human review still gates merges."
- Cost guardrails: per-developer monthly token caps + automatic model downgrades (Opus → Sonnet → Haiku) when budgets approach.
Anti-patterns
What burns teams:
- Forcing one tool on the whole team. Developer workflows vary; consolidation is a vanity goal that costs productivity.
- Adopting based on Twitter demos without your own eval on your codebase.
- Pure autonomous without review. Devin / Manus shipping PRs that bypass human review is the most common path to production incidents.
- Same model for everything. Tab completion needs a different model than refactor planning. Cursor's design assumes this; many teams don't follow through with CLI.
- Ignoring token cost. CLI agents on Opus + heavy use can quietly cost $1k+/developer/mo without dashboards.
- No MCP discipline. Installing 30 MCP servers globally creates context pollution and security surface.
- Treating SWE-Bench Verified as gospel. The Berkeley RDI paper (Apr 2026) showed harness-layer exploits. Use SWE-Bench Pro and your own evals.
Security and supply-chain risks
Real concerns that production deployments handle:
- Prompt injection via repo content (README files, tests, fixtures) can compromise an agent's actions. Sandboxes and approval workflows are mitigations.
- Tool-call abuse: an agent with broad shell access in a production repo can
rm -rf. Limit blast radius; review tool call lists. - MCP server supply chain: installing arbitrary MCP servers from npm/PyPI is code execution. Pin versions; audit.
- Token leakage: CLI agents can accidentally include API keys in context sent to model providers. Use secret-scanning in pre-prompt hooks.
- Output trust: generated code can include dependency confusion, typosquatting, or backdoors. CI must scan dependencies even (especially) for agent-introduced packages.
- Data residency: code shipped to a model provider crosses borders. Enterprise plans (Anthropic Enterprise, OpenAI Enterprise, Vertex AI) offer regional commitments.
The 2026 → 2027 outlook
- IDE category continues to consolidate around Cursor and Windsurf. Zed grows in the minimalist tier. Continue holds the open-source position.
- CLI agents proliferate further; expect Anthropic to push Claude Code as the de-facto standard, with strong OSS alternatives (Aider, OpenHands, Codex CLI, Goose).
- Autonomous agents mature; Devin gets cheaper, OpenHands Cloud grows, Manus expands outside China.
- App-builder category continues to grow as a category for non-engineers and as prototype tools for engineers.
- MCP becomes assumed. Every IDE and CLI agent supports it by default. The MCP server ecosystem expands to 1000+ servers.
- Coding-agent benchmarks evolve: SWE-Bench Pro, Terminal-Bench 2.0, ClawEval continue to be the trusted set; SWE-Bench Verified deprecates as a frontier benchmark.
- Open weights catch closed on coding-agent reliability by mid-2027 within 5%; cost advantage of open weights dominates economics.
- PR-review agents become a major category: agents that don't write code but review and improve diffs. Greptile, Coderabbit, and others are early movers.
Further reading
Internal:
- Open weights: the ultimate guide
- AI agent protocols (MCP, A2A, ACP)
- Agent serving infrastructure
- Eval infrastructure
- Post-training: RLHF and DPO
- Benchmark hacking: agent reward hacking
- Vector search & embeddings
- AI inference cost economics
- Production safety guardrails
External: