Prompt20
All posts
economicscostinferencepricingtcocapacity-planningbatch-apifine-tuning-economicsguide

AI Inference Cost Economics: The Complete Guide

The 2026 dollar-and-cents reference for AI inference: cost per token at every precision, GPU TCO math, when to self-host vs use an API, reasoning-model premium, multimodal cost shapes, capacity planning, hidden costs (KV cache, prefix caching, retries), and the decision framework that determines whether your unit economics work.

By Prompt20 Editorial · 125 min read

A frontier model API priced at $5 per million input tokens looks cheap until you do the math at scale. A startup serving 100k daily active users with 20 messages each at 1500 tokens average is burning $15,000 per day on API costs alone. Whether to keep paying that bill or to self-host on a $400k cluster is one of the highest-leverage decisions in the modern AI product stack, and the cleanest answer comes from doing the cost math honestly. This guide is the dollar-and-cents reference.

The take. AI inference cost is a function of seven things: model size, precision, context length, output length, batch size, hardware, and traffic shape. In 2026 the API economy is healthy enough that for most products at <$5M/year in inference spend, hosted APIs are cheaper than self-hosting once you account for operational cost. The crossover happens around $5–10M/year and shifts depending on your traffic concentration, latency requirements, and whether you can run your own GPUs at >40% utilisation. Reasoning models break the simple math — they consume 10–100× the tokens of standard chat for hard tasks, and the per-token price often looks the same. Multimodal models break it differently — image tokens are a 5–50× multiplier on per-request cost. Get the unit economics right at product design time and you're fine. Get them wrong and you ship the kind of product where the smarter your users get, the faster you go bankrupt.

This guide covers the full cost stack: hosted API pricing for every major model in 2026, the math of per-token cost in your own infrastructure (GPU amortization, electricity, ops, depreciation), when each path wins, the throughput multipliers that change the answer (continuous batching, quantization, speculative decoding, prefix caching), how reasoning models and multimodal change the cost shape, capacity planning for variable load, and the failure modes (over-engineering, hidden retry cost, KV cache waste, idle clusters). Cross-links: LLM serving, KV cache, quantization tradeoffs, reasoning model serving, NVIDIA AI GPU lineup 2026, MoE serving.

Table of contents

  1. Key takeaways
  2. Mental model: inference cost in one minute
  3. The five cost levers
  4. Hosted API pricing in 2026
  5. Self-hosting cost: the per-token math
  6. Hosted vs self-hosted: where the crossover is
  7. Throughput multipliers that change the math
  8. Reasoning-model economics
  9. Multimodal cost shape
  10. Capacity planning for variable load
  11. Hidden costs that surprise teams
  12. Putting numbers on your product
  13. Cost optimization playbook
  14. Batch APIs and async inference economics
  15. Fine-tuning vs RAG vs prompting: cost comparison
  16. Benchmarking your own cost-per-task
  17. Per-provider 2026 pricing tear-down
  18. Reasoning-token pricing math (the o-series problem)
  19. Prompt caching pricing across providers
  20. Self-host capex/opex deep dive (B200, H200, GH200)
  21. Hidden cost vectors
  22. Enterprise procurement: Bedrock vs Azure OpenAI vs Vertex vs direct
  23. Inference at scale: Inferentia2, Trainium2, and custom silicon pricing
  24. Per-million-token unit economics across 15 models
  25. Multi-tenant cost-allocation patterns
  26. FinOps for LLMs
  27. The bottom line
  28. FAQ
  29. Extended FAQ
  30. Glossary
  31. References
  32. Per-provider 2026 pricing tear-down: every model, every tier
  33. Reasoning-token deep math: when 25× hides in plain sight
  34. Prompt caching deep dive: OpenAI vs Anthropic vs Google
  35. Self-host break-even: B200 vs H200 worked example
  36. Hidden cost catalogue: egress, observability, retries, eval
  37. Model routing for cost: which router pattern saves what
  38. Long-context economics: when KV cache dominates
  39. Spot-vs-on-demand market in 2026
  40. Inference cost benchmarks: BENCH vs REAL prices
  41. Reasoning-effort budget: ROI optimisation

Key takeaways

  • 2026 frontier API pricing: roughly $0.10–$15 per million input tokens, $0.40–$60 per million output. Small open-weight models hosted via 3rd-party APIs go as low as $0.05 per million.
  • Self-hosting cost (everything in): 2× H100 cluster at 50% utilisation serves a 70B model at ~$0.15 per million input tokens. Comparable to or cheaper than API at high QPS.
  • Crossover from API to self-hosted is roughly $5–10M/year in inference spend for a typical SaaS workload. Below that, APIs win on simplicity. Above, self-hosted starts to pay off.
  • The 4× lever stack: quantization (2× cheaper), continuous batching (2× cheaper), speculative decoding (1.5–2× cheaper for decode-heavy), prefix caching (5–10× cheaper for repeated prompts).
  • Reasoning models: per-token price is similar to standard chat, but a hard query may emit 5000 thinking tokens vs 200 for a chat model. 25× the per-request cost. Budget reasoning explicitly.
  • Multimodal: a 1024×1024 high-detail image is 700–1500 prompt tokens — 3–10× the cost of a text-only query for the same response.
  • Hosted-API providers run at 60–80% gross margin in 2026. Self-hosting captures most of that, at the cost of operational overhead.
  • The single biggest cost mistake: assuming average load. Plan for P95 traffic; budget for the spike, not the median.

Mental model: inference cost in one minute

The named problem is the input/output asymmetry. Generating each output token costs roughly 3–5× reading an input token, because output is memory-bandwidth-bound autoregressive decode while input is compute-bound parallel prefill. Reasoning models amplify this 10–100× by emitting thousands of hidden thinking tokens that all bill as output. Pricing pages list "per million input / output tokens" side-by-side as if they're symmetric; they're not.

Think of it as an electricity bill with peak-rate hours. Input tokens are off-peak: cheap, parallel, processed in bulk. Output tokens are on-peak: serial, bandwidth-bottlenecked, and the meter spins faster the longer the model talks. Reasoning models are an industrial freezer running through dinner — same per-kWh price, very different bill.

Dimension Hosted API Self-hosted
Up-front cost $0 $200k–$500k cluster or $15–$40/hour rental
Effective $/M output tokens $0.30–$75 (model-dependent) $1.80–$10 (utilisation-dependent)
Break-even point Below ~$5M/year spend Above ~$10M/year spend
Ops headcount None 1–3 platform engineers ($500k/year loaded)
Best-fit traffic Bursty, unpredictable Steady, >40% sustained utilisation
Failure mode Surprise bills on retry loops Idle GPUs at 20% utilisation

The pseudocode of a sane cost calculator is one line: monthly_cost = users × messages_per_user × (input_tokens × input_price + output_tokens × output_price) × retry_factor. The production one-liner most teams forget: cap max_completion_tokens and set reasoning_budget explicitly — leaving either unbounded is how $5/day products become $50/day overnight.

Sticky benchmark to memorise: GPT-5 API costs roughly $5–$15 per million input tokens and $20–$60 per million output tokens in mid-2026. Gemini Flash 2.5 is 100–200× cheaper at $0.075 / $0.30. The 400× spread between cheapest and most expensive tier for the same observable request is the lever that decides whether your unit economics work.


The five cost levers

Every dollar of AI inference cost is determined by some combination of these five levers. Pull any of them and the bill changes by a measurable amount.

1. Model size. A 405B model costs ~10–20× a 7B model per token of compute. Use the smallest model that meets your quality bar. Most production workloads don't need frontier; they need consistent.

2. Precision. FP16 → FP8 cuts memory bandwidth in half and roughly doubles throughput on the same hardware. FP8 → INT4 cuts again. With minimal quality loss in 2026 (see quantization tradeoffs) you can run at INT4 weights / FP8 KV / BF16 compute and get 3–4× the throughput of a FP16-everywhere baseline.

3. Context and output length. Cost scales linearly in input tokens and linearly in output tokens. A 16k-input, 500-token-output query costs 5× a 1k-input, 200-output query. Truncate inputs aggressively; cap outputs.

4. Batch size and utilisation. A GPU at batch 1 wastes 90% of its bandwidth on decode. Same GPU at batch 64 hits 80% utilization. Continuous batching (vLLM, SGLang, TGI) closes most of the gap automatically. Self-hosting at <40% average utilization is throwing money away.

5. Hardware choice. An L40S at $1.50/hour serving small models at moderate throughput is dramatically cheaper per token than a B200 at $10/hour serving the same model. Match hardware to model size and concurrency requirements (see NVIDIA AI GPU lineup 2026).

Each lever stacks multiplicatively. A team that gets all five right is paying 1/20th of a team that gets none of them right, for the same observable quality.

Why the levers compound, not add

If you imagine each lever as a multiplier on baseline cost, the stack is 0.5× (smaller model) × 0.5× (FP8) × 0.7× (output cap) × 0.4× (batch 64) × 0.6× (right hardware) = 0.042× — about 24× cheaper than the naïve baseline. This isn't theoretical: published vLLM benchmarks on Llama-3.3-70B show 20–30× throughput differences between worst-case (batch 1, FP16, full context) and best-case (batch 64, FP8, prefix cache hot) on the same H100 hardware. Real teams sit somewhere in the middle, which is why "how much does inference cost" has no single answer.


Hosted API pricing in 2026

Reference prices as of mid-2026. Always check the provider's current page — pricing changes fast.

Frontier closed models.

Model Input ($/M tokens) Output ($/M tokens) Notes
OpenAI GPT-5 / o3 $5–$15 $20–$60 o3 thinks 5–50× more tokens than GPT-5.
OpenAI GPT-4o / o-mini $0.15–$5 $0.60–$20 Tiered by capability.
Anthropic Claude Opus 4.x $15 $75 Frontier; reasoning-heavy.
Anthropic Claude Sonnet 4.6 $3 $15 Most popular Anthropic tier.
Anthropic Claude Haiku 4 $0.80 $4 Fast / cheap tier.
Google Gemini 2.5 Pro $1.25–$2.50 $5–$10 Cheap; tiered by context length.
Google Gemini Flash 2.5 $0.075 $0.30 The cheapest frontier-class option.

Open-weight models hosted via 3rd party (Together, Fireworks, DeepInfra, Anyscale, Replicate, Groq, Cerebras).

Model Input Output Notes
Llama 3.3 70B $0.60–$0.90 $0.60–$0.90 Cheapest at the workhorse tier.
Llama 4 (when shipped) $0.50–$2.00 $0.50–$5.00 New family.
Qwen 3 $0.30–$1.00 $0.30–$1.50 Strong open-weight; cheaper than equivalent Llama.
DeepSeek V3 / R1 $0.27 / $0.55 $1.10 / $2.20 Reasoning model at standard model prices.
Mixtral 8x22B $1.20 $1.20 MoE; commonly available.
Llama 3.1 8B $0.05–$0.20 $0.05–$0.20 Cheapest viable model for most chat.

Specialty providers.

Provider Differentiator Pricing
Groq Custom LPU hardware; very fast decode ~$0.10–$1 per M tokens for major open-weight models.
Cerebras Wafer-scale chip; even faster decode Pricing per model, comparable to Groq.
SambaNova Custom hardware; reasoning models Specialty; enterprise-priced.

Pricing trends:

  • Input has dropped ~10× since 2023, output ~5×. The trend continues.
  • Reasoning models maintain roughly per-token parity with non-reasoning models. The cost difference comes from how many tokens they produce.
  • The cheapest frontier-class options (Gemini Flash, GPT-4o-mini, Haiku, DeepSeek) are now within 2–3× of the cheapest non-frontier options. The "expensive frontier vs cheap mediocre" dichotomy has narrowed.
  • Free tiers are unpredictable in pricing terms — many providers monetise by either training on conversations or upselling Pro plans rather than charging API customers.

Cost per million tokens: side-by-side comparison

Tier Representative model Input ($/M) Output ($/M) Best for
Cheap-fast Gemini Flash 2.5 $0.075 $0.30 High-volume chat, classification, extraction
Cheap open-weight Llama 3.1 8B via Together $0.05–$0.20 $0.05–$0.20 Background async, embeddings, fallback
Mid open-weight Llama 3.3 70B $0.60–$0.90 $0.60–$0.90 Most production workloads
Mid frontier Claude Sonnet 4.6 $3 $15 Customer-facing agents, complex reasoning
Mid frontier GPT-4o $2.50 $10 Mixed workloads, multimodal
Reasoning DeepSeek R1 $0.55 $2.20 Hard reasoning at standard prices
Reasoning frontier OpenAI o3 $15 $60 Hardest reasoning tasks
Frontier Claude Opus 4.x $15 $75 Mission-critical analysis

A 1500-input / 300-output query costs $0.0001 on Gemini Flash, $0.009 on Sonnet 4.6, $0.04 on Opus 4.x, $0.04 on o3 (more with thinking tokens). 400× spread between the cheapest and most-expensive tier for the same observable request shape. Pick the tier deliberately.

Pricing as a moving target

API pricing dropped 80% from 2023 to 2026 on equivalent capability tiers. The driver is competition (Gemini Flash dragged the floor down), efficiency improvements (Hopper → Blackwell at the inference layer), and architectural advances (mixture-of-experts brought effective compute down; see MoE serving). Budget 5–10× per year of further compression on cost-per-quality through 2027; don't budget the same nominal prices.


Self-hosting cost: the per-token math

The reference math, made concrete. We'll cost out a Llama-3.3 70B on H100 hardware.

Cluster setup: 8× H100 SXM (one HGX node). FP8 weights via NVIDIA Transformer Engine. vLLM serving. Continuous batching.

Capital cost:

  • 8× H100 SXM with full HGX baseboard + system: ~$300,000 to purchase.
  • Or: rent at $25–$40/hour managed on a cloud (AWS p5.4xlarge, GCP A3, Azure ND), which is $220k–$350k/year for 24×7.
  • Or: rent at $15–$25/hour from specialists (Lambda, CoreWeave, RunPod), $130k–$220k/year.
  • Or: rent at $5–$15/hour from decentralized (io.net, Akash, Vast.ai), $44k–$130k/year — see decentralized GPU economics.

We'll use $20/hour as the working number (mid-range specialist). 24×7 = $175k/year.

Throughput at moderate concurrency (50 concurrent requests, average 1500 input + 300 output tokens, FP8):

  • ~3000 output tokens/second sustained on this cluster.
  • ~15000 input tokens/second (prefill is faster per token).

Cost per million tokens:

  • $175,000/year ÷ 365 ÷ 86,400 = $0.0055/second.
  • Output: $0.0055 / 3000 tokens/sec = $0.0000018 per output token = $1.83 per million output tokens.
  • Input: $0.0055 / 15000 tokens/sec = $0.000000367 per input token = $0.37 per million input tokens.

Compare to API price (Llama-3.3 70B via hosted API): ~$0.60–$0.90 per M tokens. Self-hosting at 100% utilization is ~50–70% the price of hosted.

The catch: utilization. The math assumes 100% utilization, which doesn't happen. Realistic numbers:

  • 50% average utilization: 2× the per-token cost = $3.66 output / $0.73 input.
  • 30% average utilization: 3.3× = $6.04 output / $1.21 input.
  • 20% average utilization: 5× = $9.16 output / $1.84 input.

At 30% utilization (typical for SaaS without aggressive batch optimization), self-hosting costs more than the hosted API. The hosted provider runs at 60–80% utilization across their fleet — that's where their margin comes from.

Plus operational cost:

  • One ML platform engineer: $300k/year fully loaded.
  • One backend engineer (partial allocation): $150k/year.
  • Monitoring, logs, on-call: $50k/year.
  • Total ops: $500k/year for a small but functional team.

Add ops to a single-cluster self-host: $175k + $500k = $675k/year. At 100% utilization that's $7 per million output tokens; at 30% it's $23.

You don't self-host on one cluster. The economics only make sense at 10+ clusters where the engineering team is amortised. Then the per-token cost approaches the raw GPU cost again.

Hardware choice changes the math more than people expect

The per-token math above assumes 8× H100 SXM at $20/hour. The full hardware landscape in 2026 is wider, and matching hardware to workload is one of the larger cost levers:

Hardware Hourly cost Best workload Per-M output token (70B, 50% util)
8× B200 SXM $50–$80/hour Reasoning models, long context ~$2.50–$4
8× H200 SXM $25–$40/hour Standard 70B at high concurrency ~$2.00–$3.50
8× H100 SXM $15–$25/hour Baseline 70B serving ~$3.50–$6
8× L40S $4–$8/hour Small models (≤13B), batch ~$1.50–$3
8× MI300X (AMD) $12–$20/hour Memory-bound 70B+ ~$2–$4
Groq LPU $/M token billed Latency-sensitive decode ~$1–$3

The "right" hardware depends on whether your bottleneck is memory bandwidth (decode-heavy), compute (prefill-heavy), or interconnect (training/very large models). See NVIDIA AI GPU lineup 2026 for the matching guide.

Power, cooling, and the line items nobody mentions

Renting at $/hour bakes in power and cooling. Buying ($300k for an 8× H100 HGX) doesn't. Power for an 8× H100 SXM node at full load is ~10 kW. At $0.10/kWh that's $24/day, $8,760/year — under 5% of hardware cost. Cooling adds 30–50% to power. Colo space at $300–$1000/kW-month adds more. A bought-and-racked 8× H100 deployment runs ~$25k/year in power and colo, on top of $300k capex and amortisation. Self-hosting math without these line items is wrong by 10–15%.


Hosted vs self-hosted: where the crossover is

The decision rule, distilled:

Stay on hosted APIs if:

  • Annual inference spend is under $5M.
  • Traffic is bursty or unpredictable.
  • You don't have an ML platform team.
  • You need access to multiple frontier models and route between them.
  • Strict latency SLAs (sometimes hosted providers run faster than you would).
  • You haven't yet validated unit economics with paying customers.

Consider self-hosting if:

  • Annual inference spend is over $10M.
  • Traffic is steady enough to hold >40% sustained utilization.
  • You have or can hire an ML platform team.
  • Privacy or data residency requires data not leaving your infrastructure.
  • You're running an open-weight model with no closed equivalent.
  • You can deploy on specialist or decentralized GPU pricing (<$15/hour).

The grey zone, $5–10M: make the decision based on the operational story. Most teams discover that the hosted API cost is annoyingly large but the operational ceiling of self-hosting is also annoyingly high. The arithmetic is close; the qualitative factors decide.

Hybrid is common. Many teams use hosted APIs for low-volume routes (long-tail customers, complex queries) and self-host for high-volume routes (a primary chat endpoint with predictable load). This captures most of the cost savings without committing the entire stack.


Throughput multipliers that change the math

The cost numbers above assume "vanilla" inference. Production stacks layer on several multipliers that change the effective per-token cost dramatically.

Continuous batching (vLLM PagedAttention). 2–5× throughput vs static batching. Already on by default in vLLM, SGLang, TGI. If you're not using it, you're at 1/3 of your hardware's potential. See LLM serving.

Quantization. INT4 weights + FP8 KV: ~2× throughput vs FP16 baseline. ~3× vs FP32. Minimal quality cost in 2026.

Speculative decoding. EAGLE-2 with a small draft model: 1.5–2× decode speedup. Free win for decode-heavy workloads. See speculative decoding.

Prefix caching. For repeated prompts (system prompts, document context, common conversation prefixes), 5–10× speedup on the cached portion. Effectively free.

Disaggregated prefill/decode (disaggregated inference). 1.5–3× throughput at high concurrency by giving prefill and decode different hardware. Operational complexity is real.

KV cache quantization. FP8 or INT8 KV: 2× KV memory, more concurrent users at the same hardware. Often more impactful than weight quantization at high concurrency.

Multi-LoRA (multi-tenant fine-tunes). Serve 100+ adapters on one base model with ~10% throughput hit. Effective cost-per-customer drops 50×. See multi-tenant LoRA.

Stacked, these change the math substantially. A team running all of the above gets roughly 5–10× the throughput of a naïve baseline. Translation: 5–10× cheaper per token at the same hardware cost. This is why production stacks look different from research baselines.

Stacking the multipliers in production: a worked example

Start from a vanilla deployment: 8× H100, Llama-3.3 70B, FP16 weights, batch size 1, no caching, no speculative decoding. Measured throughput: ~400 output tokens/sec sustained. Cost basis: $20/hour rental, so $50/M output tokens at 100% utilisation. At realistic 50% utilisation: $100/M.

Now apply the stack:

  1. Enable continuous batching (vLLM defaults). Throughput climbs to ~1,200 tokens/sec at batch 32. $33/M at 50% util.
  2. Quantize weights to FP8 via Transformer Engine. ~2,000 tokens/sec at batch 32. $20/M at 50% util.
  3. Enable FP8 KV cache. Memory savings let batch grow to 64. ~2,800 tokens/sec. $14/M.
  4. Enable prefix caching for system prompts. For 70% prefix-hit traffic, effective input cost drops ~3×. Combined effective throughput: ~3,500 tokens/sec equivalent. $11/M.
  5. Enable speculative decoding (EAGLE-2 with a 1B draft model). 1.7× decode speedup for the 80% of traffic that benefits. Effective ~4,500 tokens/sec. $9/M.
  6. Add disaggregated prefill/decode. P95 latency improves; aggregate throughput at 60% utilisation now reaches 5,500 tokens/sec. $6/M.

End state: roughly 14× cheaper per token than the vanilla deployment, at the same hardware spend. The hosted-API cost for Llama-3.3 70B is $0.88/M (Together) or $0.59/$0.79 (Groq). Self-hosting at $6/M is still 7–10× more expensive per token than the cheapest hosted option — because Together and Groq have applied the same stack and run at higher fleet-wide utilisation. The lesson: stacking multipliers is necessary but not sufficient to beat hosted prices on its own. Utilisation scale is the final lever.


Reasoning-model economics

Reasoning models (OpenAI o3, DeepSeek R1, Claude with extended thinking, Gemini Deep Think) break the simple per-token math because their output token count is the wild card.

Standard chat: user asks a question, model emits 100–500 output tokens. Per-request cost: $0.001–$0.01.

Reasoning model: user asks a question, model thinks 1000–10000 thinking tokens (visible or hidden depending on the model), then emits 100–500 visible output tokens. Per-request cost: $0.02–$0.50.

The per-token price for o3 input/output is similar to GPT-5. The total bill is 10–50× higher because of the thinking tokens, billed as output.

Reasoning effort tiers and what they cost

Model Effort tier Avg thinking tokens Per-query cost (1k input, 500 visible output)
o3 low ~500 $0.075
o3 medium ~2000 $0.165
o3 high ~8000 $0.525
Claude Opus 4.x extended thinking budget=2000 ~2000 $0.165
Claude Opus 4.x extended thinking budget=16000 ~12000 $0.945
DeepSeek R1 default ~4000 $0.0099
Gemini 2.5 Deep Think default ~3000 $0.020

DeepSeek R1 at $0.0099 for the same query shape o3 charges $0.525 for is a 50× price gap, and DeepSeek R1 is within 5–15% of o3 on most reasoning benchmarks (MATH, GPQA, AIME). For products where cost matters and the marginal accuracy difference doesn't, open-weight reasoning is the cost winner of 2026.

When the math works:

  • Hard reasoning tasks where the model's accuracy genuinely matters: legal, medical, scientific, complex coding. Spending $0.50 to get a correct answer beats $0.01 for the wrong answer.
  • Low-volume / high-value queries: a research assistant that runs 100 queries per day at $0.50 each costs $50/day. A frontier engineer's salary makes that trivial.

When it doesn't:

  • High-volume consumer queries where most don't need reasoning. Routing 100% of traffic to o3 burns money on questions like "what's the weather."
  • Latency-sensitive applications. Reasoning models take 5–60 seconds; chat models take 1–3.

Routing. The production pattern is to classify queries upfront — simple chat → cheap chat model, hard reasoning → reasoning model. A small classifier (or a cheap LLM call) makes the routing decision. Saves 80%+ of reasoning-model cost in mixed workloads.

Thinking budget controls. Some products expose reasoning_budget parameters: cap how many thinking tokens the model can emit. Sets a ceiling on per-request cost. OpenAI's max_completion_tokens and Anthropic's thinking_budget both serve this purpose. Use them — leaving reasoning unlimited in production is a budgeting accident waiting to happen.


Multimodal cost shape

Multimodal models (vision, audio, video) shift the cost balance toward input.

Text-only chat:

  • Average 200 input tokens × $1/M = $0.0002.
  • Average 500 output tokens × $5/M = $0.0025.
  • Total: ~$0.003 per request, output-dominated.

Vision query with one 1024×1024 image:

  • Image: ~1000 prompt tokens × $1/M = $0.001.
  • Text prompt: 30 tokens × $1/M = $0.00003.
  • Output: 200 tokens × $5/M = $0.001.
  • Total: ~$0.002 per request, input-balanced.

The image roughly doubles the cost vs text-only.

Video understanding (5-minute clip):

  • ~5000 video tokens × $1/M = $0.005.
  • Text prompt: 30 tokens.
  • Output: 500 tokens × $5/M = $0.0025.
  • Total: ~$0.008 per request, input-dominated.

Audio in (1 minute of audio with Whisper + LLM):

  • Whisper transcription: ~$0.006 per minute.
  • LLM call on transcript: ~$0.003.
  • Total: ~$0.009 per minute.

Audio in (native, GPT-4o voice):

  • Per-minute audio at higher per-minute price.
  • Total: ~$0.02–$0.06 per minute depending on tier.

Practical pattern. Route multimodal queries away from default routing. Only ship to vision/audio models when the user actually attaches media. For high-volume text-only traffic, the multimodal models are 5–10× the cost of a text-only equivalent — significant at scale. See multimodal serving for the full architecture story.


Capacity planning for variable load

Average load isn't what matters. Production AI traffic has shape that breaks naïve capacity sizing.

The shapes:

  • Daily curve. 4× peak-to-trough is typical for consumer products. Office-hours skewed for B2B.
  • Weekly curve. Weekday/weekend variance is 2–3× for consumer, sometimes higher for productivity tools.
  • Launch spikes. Marketing event traffic can be 10–20× baseline for hours.
  • Viral incidents. Hacker News front page or a tweet can be 50–100× baseline for hours, decaying over days.

Sizing strategies:

  • Size to P95. Capacity = projected P95 traffic + 30% headroom. Idle at average; survives normal peaks.
  • Auto-scale on hosted APIs. Most hosted APIs scale transparently. Cost scales linearly with usage. The default for early-stage.
  • Reserve + on-demand on self-hosted. Reserve enough capacity for steady traffic; burst to on-demand cloud for spikes. Adds complexity but caps the cost.
  • Hybrid hosted + self-hosted. Self-host for steady traffic; spill to hosted APIs for spikes. Works if the spillover model is API-compatible with your self-hosted model (Llama base via your servers and via Together API, same Llama-3.3-70B).

The cheapest-but-wrong strategy: size to average traffic. You'll OOM or time out during every peak. Customers will see the slowness; some won't come back. False economy.

The most-expensive-but-right strategy: size to P99 plus 50%. You're paying for a lot of idle capacity. Acceptable if downtime is catastrophic (medical, financial real-time).

Most teams should target P95 + 30% in 2026. Adjust based on your domain's tolerance for latency degradation.

The 2026 capacity table

Workload shape Sizing rule Why
Steady B2B SaaS P95 + 20% Predictable; weekday curve dominates
Consumer chat P95 + 30%, plus spike reserve 4× daily curve, viral risk
Voice / real-time agents P99 + 50% Latency degradation = call drop
Internal eval / batch P50 Run during off-peak; tolerate queueing
Healthcare / financial real-time P99 + 100% Downtime is catastrophic
Free tier / trial P90 + 0% Acceptable to throttle during spikes

The expensive part is that "spike reserve" — keeping hot capacity for 10× traffic spikes that happen once a quarter. Hosted APIs handle this implicitly through auto-scaling; self-hosters often pay for reserved capacity that sits at <10% utilisation 99% of the time and at 100% during a viral incident. The hybrid pattern (own steady, burst to hosted) optimises this; few teams implement it well.


Hidden costs that surprise teams

The cost stack has corners that don't show up in the pricing pages.

Retries on failure. A request that fails and retries costs 2×. Configure retry policies carefully — exponential backoff with jitter, max 2 retries. A bug that retries 10× silently turned a $5/day product into $50/day overnight; this is real.

Prompt bloat. "Let me copy this whole conversation history into every turn" — the typical agent pattern. A 10-turn agent with 1500 tokens of history per turn costs 10× a single-turn query. Use compaction, sliding windows, summaries (see agent serving infrastructure).

System prompt growth. A system prompt grows over time as new instructions get added. Each request pays for the system prompt every time. Audit and prune.

Idle KV cache. For self-hosted, KV cache pages allocated to abandoned conversations are dead weight. Aggressive timeout policies free them; conservative policies waste GPU memory and reduce concurrency.

Function-call rounds. Tool-calling LLMs make 2–5 LLM calls per "user message" (one to decide to call a tool, one to consume the tool result, sometimes more). Per-message cost is 2–5× a chat-only equivalent.

Speculative decoding misses. When the draft model proposes 4 tokens and only 2 are accepted, the rejected work is wasted compute. At low acceptance rates the technique can hurt rather than help.

Failed long-context queries. A 1M-token prompt that fails because of an error costs the full prefill. Stream and abort early on error.

Embeddings. Often forgotten. Re-embedding a corpus on every model update at $0.0001/embedding × 100M chunks = $10,000. Plan re-embedding cadence.

Image encoder cold-start. Vision serving requires the vision encoder. If you serve it inelastically (always-on dedicated GPU), the encoder's idle cost is real.

Eval cost. Running RAGAS-style automated eval on every release × every prompt change × hundreds of test cases at $0.05/eval × 1000 = $50/release. Per release × 100 releases/year = $5,000. Real budget item.

API throughput limits. Hosted providers cap RPS and TPS per account. Hitting the cap turns into business-line outage. Negotiate enterprise tiers if your peak load exceeds the default.

The cost of hitting rate limits

OpenAI, Anthropic, and Google all enforce per-organisation RPM and TPM (requests- and tokens-per-minute) caps. Defaults: OpenAI Tier 1 caps GPT-4o at 500 RPM / 30k TPM; Anthropic Tier 1 caps Sonnet at 50 RPM / 50k input TPM / 10k output TPM; Gemini caps Pro at 360 RPM. Real production loads breeze past these by 10× during normal operation.

Path to higher tiers: spend qualifies you. OpenAI Tier 5 (highest) requires $5,000 cumulative spend + 30 days. Anthropic Tier 4 requires $400 deposit and ~$400 monthly. Vertex defaults are more conservative; you raise via support ticket.

What it costs to hit limits without a plan: requests 429, your retry logic stacks them up, latency degrades, customer-facing features fail. The cost is reputational; it doesn't appear on the bill. Pre-negotiate enterprise tiers before launch, not after the incident.


Putting numbers on your product

The unit-economics question for an AI product is: does the revenue per user exceed the inference cost per user?

Worked example: a SaaS chat product.

  • Average user sends 50 messages per month.
  • Average message: 800 input tokens (history + new message), 300 output tokens.
  • Model: Claude Sonnet 4.6 ($3 input, $15 output).
  • Per-message cost: 800 × $3/M + 300 × $15/M = $0.0024 + $0.0045 = $0.0069 ≈ $0.007.
  • Per-user-per-month cost: 50 × $0.007 = $0.35.
  • Charge $9.99/month: 96% gross margin on inference.

Worked example: a customer-support agent.

  • Average ticket: 5 LLM-tool rounds × 2000 input + 500 output tokens each.
  • Model: Sonnet 4.6.
  • Per-ticket cost: 5 × (2000 × $3/M + 500 × $15/M) = 5 × $0.0135 = $0.068.
  • At 1M tickets/year, that's $68,000. Significant but not catastrophic.

Worked example: a reasoning-heavy research tool.

  • Average query: 1000 input tokens, 5000 reasoning tokens (output).
  • Model: o3 ($15 input, $60 output).
  • Per-query cost: 1000 × $15/M + 5000 × $60/M = $0.015 + $0.30 = $0.315.
  • 100 queries/user/month × $0.315 = $31.50.
  • Charge $99/month: still profitable, but you need other moats too — pure cost margin is only 68%.

Worked example: image-heavy product (visual search).

  • Average query: 1 image (1000 tokens) + 30 token prompt → 200 token answer.
  • Model: GPT-4o vision.
  • Per-query cost: 1030 × $5/M + 200 × $20/M = $0.005 + $0.004 = $0.009.
  • High-volume free tier serving 1M queries/month: $9,000/month. Need real monetisation path.

Worked example: AI inside an enterprise product (B2B SaaS).

  • 1000 customers × 100 employees × 200 AI-touched actions/employee/month = 20M actions/month.
  • Average action: 500 input + 200 output tokens, Sonnet 4.6.
  • Cost per action: $0.0045.
  • Monthly cost: $90,000.
  • Customer pays $50/user/month: $5,000,000 revenue, $90k cost, 98% gross margin. B2B SaaS economics shine on AI when usage is bounded per user.

The pattern: per-user revenue must exceed per-user AI cost by 3–10× to sustain a healthy business after support, sales, R&D, and other costs. Run the math at product design time, not after launch.

Sensitivity analysis: which inputs move the bottom line

For the SaaS chat example above ($0.35/user/month at $9.99 charge), here's how the math shifts on each input change:

Change New per-user cost Margin impact
Sonnet 4.6 → Haiku 4.5 $0.094 +2.6 percentage points
Sonnet 4.6 → Opus 4.x $1.92 -16 percentage points
Add prefix caching (70% hit rate) $0.16 +1.9 pp
Double avg conversation length (100 msg) $0.70 -3.5 pp
10× viral spike for 1 month $3.50 -32 pp (that month)
Switch to o3-medium routing $13.25 catastrophic

The implication: the cost-per-user math is robust to ~2× error on the input assumptions for most well-routed products. It is not robust to a model upgrade decision that swaps Sonnet for Opus or a routing bug that sends everything to a reasoning model. Build the product-economics tracking before the routing changes can land in production.

Free-tier and trial economics

Most consumer AI products run a free tier. The trap: free users consume ~30–50% of the paid-tier inference volume per user because they get less context retention and try the product less repeatedly, but they still cost real money. A common rule: free tier costs ≤30% of paid-conversion revenue. If your free-to-paid conversion is 5%, each paid user supports 19 free users. At $0.35/user/month on the free tier × 19 free users = $6.65 cost per paid user. The paid user revenue must exceed that plus the paid user's own inference cost plus all non-inference costs. Tight margins in chatbot products with low conversion rates are typically a free-tier cost problem, not a paid-tier problem.


Cost optimization playbook

The ranked list of cost levers, in order of typical impact.

1. Use a smaller model. A 7B model that meets your quality bar is 10–20× cheaper than a 70B model. Test smaller variants every release cycle.

2. Route by query difficulty. Easy queries → cheap fast model. Hard queries → expensive smart model. A simple classifier or first-pass LLM judges the routing. 70%+ of traffic is easy.

3. Cap output tokens. Most production tasks don't need 4000-token responses. Set max_tokens to 500 unless you specifically need more.

4. Truncate inputs. Long system prompts, full conversation history, redundant context — trim them. Compaction and summarisation tools (LangChain, custom) reduce per-turn input cost.

5. Enable prefix caching. If your prompts share prefixes (system prompt, document context), turn on prefix caching in your serving stack. 5–10× speedup on cached portions.

6. Use a reasoning budget. For reasoning models, set explicit token caps. Prevents tail latency and runaway cost.

7. Cache LLM responses. For deterministic queries (FAQ-like patterns), cache the response in Redis. Cheap and effective.

8. Quantize. FP8 or INT4 weights, FP8 KV. 2–4× throughput improvement. Test quality before flipping.

9. Speculative decoding. EAGLE-2 or similar for decode-heavy traffic. 1.5–2× speedup.

10. Batch when possible. Async / batch workflows can run at higher batch sizes than real-time. Use batch APIs from OpenAI / Anthropic for 50% discount on the same model.

11. Negotiate enterprise pricing. At >$50k/month spend on any hosted provider, ask for committed-use pricing. 20–40% discounts are common.

12. Move steady traffic self-hosted. When you reach the $5M/year inference spend threshold, evaluate self-hosting on specialist GPU pricing ($15–25/hour H100). Hybrid is usually optimal.

13. Use the right reasoning depth. o3 with reasoning_effort: low is often as good as reasoning_effort: high at 1/5 the cost. Test before defaulting to max.

14. Audit your retries. Bugs that retry too aggressively are the #1 cause of cost spikes. Set max_retries=2 and log retries.

15. Off-peak batch processing. For non-real-time workloads, run during off-peak hours when GPUs are cheaper.


Batch APIs and async inference economics

Batch APIs are the cheapest path for non-real-time work and are systematically underused.

What batch APIs are and what they cost

OpenAI, Anthropic, and Google all offer batch tiers: submit a JSONL file of requests, receive results within 24 hours, pay 50% of the synchronous price. OpenAI's Batch API caps at 50,000 requests / 100 MB per file; Anthropic's Message Batches API supports up to 100,000 requests with 256 MB caps; Google's Gemini batch API runs through Vertex with similar limits. All three guarantee completion within 24 hours; most batches finish in 1–4 hours in practice.

A 50% discount stacks with everything else. A Sonnet 4.6 query that costs $0.009 synchronously costs $0.0045 in batch. At 1M queries/month, that's $4,500 saved with zero engineering work beyond switching the endpoint.

When batch wins and when it doesn't

Batch wins for: nightly analytics, eval runs, document processing, dataset generation, embedding refreshes, content moderation backfills, summarising historical chat logs. Batch loses for: anything with a user waiting, anything with real-time data dependencies, anything where the result feeds another LLM call in the same flow.

Hybrid pattern: route latency-insensitive work (the 30% of LLM calls in most products that are background analytics or async enrichment) to batch. Saves a hard 15% of total inference spend in typical SaaS.

Self-hosted async: even cheaper

If you self-host and have idle capacity overnight, run async work then. Some teams use a "batch queue" pattern: jobs submitted during business hours queue up; the inference cluster drains them at 80%+ utilisation during off-peak hours when chat traffic is low. Effective cost: marginal — you're using GPUs that were sitting idle.


Fine-tuning vs RAG vs prompting: cost comparison

The "should I fine-tune?" question has a cost dimension that often dominates the technical one.

Prompting (zero-shot or few-shot in the prompt)

Setup cost: zero. Per-call cost: pay for the input tokens carrying your examples or instructions. For 5 examples of 200 tokens each = 1000 extra input tokens per call. At Sonnet 4.6 rates that's $0.003/call extra. At 1M calls/month: $3,000/month. Pros: no training pipeline, no model versioning, fast iteration. Cons: examples eat context budget; large prompts add latency.

RAG (retrieval-augmented generation)

Setup cost: embedding the corpus ($0.0001/embedding × 1M chunks = $100 one-time, plus storage). Per-call cost: pay for retrieved context (typically 1–3k tokens) on every call. At Sonnet 4.6 rates, 2000 extra input tokens = $0.006/call extra. At 1M calls/month: $6,000/month, plus ~$500/month for vector store. Pros: source-grounded, easier to keep current, citations. Cons: retrieval quality matters; longer context. See RAG production architecture.

Fine-tuning (LoRA, full fine-tune, distillation)

Setup cost: training data prep ($10k–$100k of human-curated examples) + training compute ($500–$5,000 for a LoRA on an open-weight model, $5k–$50k for a full fine-tune, $0 for a closed-model fine-tune API but $25–$100 per million training tokens). Per-call cost: same as base model, possibly cheaper if fine-tuning lets you drop to a smaller model. At 1M calls/month with a 13B fine-tune replacing a 70B base, cost drops 5×. Pros: smallest per-call cost, lowest latency, style baked in. Cons: high setup cost, brittle to data drift, model versioning operational burden.

When each wins

Prompting wins below ~100k calls/month. RAG wins when freshness or sourcing matters, regardless of volume. Fine-tuning wins above ~5M calls/month on a stable task. Many production stacks layer all three: a fine-tuned smaller model + RAG for current info + few-shot examples for edge cases. See multi-tenant LoRA serving for the per-customer fine-tune pattern.

Cost-per-quality crossover example

For a customer-support classifier processing 10M tickets/year:

  • Prompting GPT-4o with 5 examples: $10/M tokens × ~1500 tokens/call × 10M = $150,000/year.
  • RAG with retrieved similar tickets: similar input cost + retrieval infra = ~$170,000/year.
  • Fine-tuned Llama-3.3 8B (LoRA): $5k training + $0.10/M × 500 tokens × 10M = $5,500 + $500 = $6,000/year.

The fine-tune costs 1/25th and runs faster. For sufficiently high-volume, stable-task production workloads, fine-tuning is the obvious answer.


Benchmarking your own cost-per-task

Provider pricing pages tell you per-token cost. Your application's cost-per-business-outcome is different and matters more.

Define the task, then measure

Pick the unit your business cares about: cost-per-resolved-ticket, cost-per-generated-article, cost-per-classification, cost-per-user-day. Instrument inference calls with structured logs: model, input tokens, output tokens, latency, tool calls, retries, end-to-end task success. Aggregate weekly.

A typical instrumentation: a single span per task with attributes model, input_tokens, output_tokens, cached_tokens, tool_calls, total_cost_usd, task_success. OpenTelemetry semantic conventions for GenAI (opentelemetry.io GenAI conventions) standardise the field names.

Real benchmark suites worth running

  • HELM (Stanford, crfm.stanford.edu/helm) — broad model quality eval; useful for picking model tier per task.
  • MT-Bench and AlpacaEval 2.0 — chat quality benchmarks, useful for distinguishing tiers within a model family.
  • HumanEval, MBPP, SWE-bench — code tasks; if you're shipping code AI, run these on your candidate models.
  • GAIA, AgentBench, SWE-agent — agent benchmarks; for tool-use products.
  • RAGAS — automated faithfulness/answer-relevance metrics for RAG pipelines.

Run your candidate models on your eval set. Compute cost-per-correct-answer, not cost-per-call. Sonnet 4.6 at $0.009/call with 95% accuracy may beat Gemini Flash at $0.0001/call with 70% accuracy if a wrong answer costs you a customer.

Production budget guardrails

Set per-tenant, per-user, per-day spend caps in your API gateway. Page when daily spend exceeds 1.5× the trailing 7-day average. Alert on unusual model mix (a sudden shift to o3 from Sonnet usually means a bug or a misrouted request). Most inference cost incidents are runaway loops, not gradual growth — guardrails catch them in hours, not weeks.


Per-provider 2026 pricing tear-down

The headline tables earlier in this guide compress nine providers into a few rows. The cost-per-task answer at scale depends on details the headline table hides: which tier of each model you're paying for, what surcharges apply for long context, what discounts apply for batch and caching, and what the actual per-second throughput looks like in production. The next ten subsections take each major provider apart.

OpenAI: GPT-5, o3, o4, GPT-4o family

OpenAI's 2026 pricing matrix is the most multi-tiered of the frontier providers. Six dimensions matter.

Model tiers and headline prices (mid-2026):

Model Input ($/M) Output ($/M) Cached input ($/M) Notes
GPT-5 $5.00 $20.00 $1.25 Frontier general-purpose
GPT-5 mini $0.60 $2.40 $0.15 Cheaper GPT-5 family
o3 $15.00 $60.00 $7.50 Reasoning, high-effort tier
o3-mini $1.10 $4.40 $0.55 Cheaper reasoning
o4-preview $7.50 $30.00 $1.88 New reasoning tier
GPT-4o $2.50 $10.00 $1.25 Older general; still widely used
GPT-4o mini $0.15 $0.60 $0.075 Cheap general; high-volume default
GPT-4o vision $2.50 $10.00 $1.25 Same as text + image tile cost
GPT-4o audio (Realtime) $40 audio in / $80 audio out / $5 text in / $20 text out Per million audio tokens
TTS-1 $15/M chars Standard voice
TTS-1 HD $30/M chars High-quality voice

Batch API discount: flat 50% off both input and output. Caps at 50,000 requests / 100 MB per file. 24-hour SLA, typically completes in 1–4 hours.

Prompt caching: automatic when prefix ≥ 1024 tokens is reused. Cached input billed at ~25% of regular input price (the table column above). No explicit cache_control markers needed. Cache TTL ~5–10 minutes idle. Best-in-class developer ergonomics — works without code changes.

Reasoning surcharge: o3 with reasoning_effort: high typically emits 5,000–20,000 thinking tokens per response. All thinking tokens bill as output at $60/M. A query that costs $0.075 on o3-low can cost $1.20+ on o3-high for the same prompt. Always specify effort level explicitly.

Long context surcharge: GPT-5 has 1M context at the base price (no surcharge). o3 caps at 200k. Vision adds tile-based input tokens (more in Section 9 of multimodal serving).

Enterprise tier: ChatGPT Enterprise + API committed-use discounts at $50k+/month range from 15–25%. Azure OpenAI parity pricing is within 5%; Azure provides Microsoft enterprise agreement leverage.

Anthropic: Claude Opus 4.x, Sonnet 4.6, Haiku 4.5

Anthropic's 2026 pricing is simpler — three primary tiers — but with the most aggressive prompt-cache discount in the market.

Model Input ($/M) Output ($/M) Cache write ($/M) Cache read ($/M)
Claude Opus 4.x $15.00 $75.00 $18.75 $1.50
Claude Sonnet 4.6 $3.00 $15.00 $3.75 $0.30
Claude Haiku 4.5 $0.80 $4.00 $1.00 $0.08

Cache mechanics: explicit cache_control: {type: "ephemeral"} block markers (up to 4 cache breakpoints per request). Cache TTL is 5 minutes by default, 1-hour optional tier at 2× the write cost. Cache reads at ~10% of regular input price — the largest cache discount in the industry. For a 50k-token system prompt reused across 10,000 daily messages, Anthropic's caching cuts input cost by ~85%.

Batch API (Message Batches): flat 50% discount on input and output, 100k requests/batch, 256 MB limit. 24-hour SLA. Cache-aware: cache hits stack with the batch discount.

Extended thinking: Claude Opus 4.x and Sonnet 4.6 support thinking_budget parameter ranging 0–32,000 (Opus) or 0–16,000 (Sonnet). Thinking tokens bill as output. A thinking_budget: 16000 Opus call worst-case adds $1.20 per response. Setting it explicitly is the difference between predictable cost and runaway bills.

Vision pricing: images priced as 1.6 input tokens per image pixel after resizing to fit the model's tile grid. A 1568×1568 image (Claude's max) is ~1568 prompt tokens; smaller images scale down linearly.

Google: Gemini 2.5 Pro, Flash, Flash-Lite, Nano

Google's pricing has the widest spread between cheapest and most expensive tier of any major provider.

Model Input ($/M) Output ($/M) Cached input ($/M) Context
Gemini 2.5 Pro (≤200k tokens) $1.25 $5.00 $0.31 1M context
Gemini 2.5 Pro (>200k tokens) $2.50 $10.00 $0.625 Same model, surcharge tier
Gemini 2.5 Flash $0.075 $0.30 $0.019 1M context
Gemini 2.5 Flash-Lite $0.025 $0.10 $0.006 32k context
Gemini Nano (on-device) $0 $0 Pixel devices
Imagen 4 $0.04/image Image generation
Veo 3 $0.50/second Video generation

Long-context surcharge: Gemini 2.5 Pro charges 2× rate above 200k tokens. A 500k-token request bills 200k at $1.25/M + 300k at $2.50/M = $1.00 — significant when context-stuffing.

Batch API: Vertex AI batch prediction at ~50% discount. Files up to 10 GB; SLA 24 hours.

Context caching: Google's cachedContents API with explicit TTL (default 1 hour, configurable). Cached read at ~25% of regular input. Cache storage billed separately at $1/M tokens/hour for Flash, $4.50/M tokens/hour for Pro — a cost vector that surprises teams. A 100k-token cache held for 24 hours on Flash: 0.1 × 24 × 1 = $2.40 storage, plus per-read at $0.019/M tokens. Compare to OpenAI/Anthropic where storage is implicit.

Free tier: Google's free tier on Gemini 2.5 Flash via AI Studio is generous (1,500 RPD, 1M TPM). The catch: free-tier requests train future models unless you opt out (paid tier does not train).

Mistral: Large, Codestral, Pixtral, Mistral Small/Nemo

Model Input ($/M) Output ($/M) Notes
Mistral Large 2 (123B) $2.00 $6.00 Frontier dense
Codestral 2 $0.20 $0.60 Code-specialised
Pixtral Large $2.00 $6.00 Vision frontier
Mistral Nemo (12B) $0.15 $0.15 Cheap dense
Mistral Small 3 $0.20 $0.60 Cheap-mid tier
Codestral Embed $0.10/M Code embeddings

Mistral's open-weight licensing means most models are also available self-hosted or via Together/Fireworks at ~50% of Mistral's direct API price. La Plateforme (Mistral's API) wins on EU data residency; for non-EU use cases the price-performance edge is narrow.

Cohere: Command R+, Command R, Aya

Model Input ($/M) Output ($/M) Notes
Command R+ $2.50 $10.00 Frontier RAG-tuned
Command R $0.15 $0.60 Cheap-mid RAG
Aya 32B $0.50 $1.50 Multilingual
Embed v3 (English) $0.10 Best-in-class embeddings
Rerank v3 $2.00 / 1k searches Rerank API

Cohere's competitive moat is enterprise RAG. Command R+ ships with tool-use, structured outputs, and citation grounding built in. The Rerank v3 endpoint is widely used as the second stage in production RAG stacks regardless of which LLM does the generation.

xAI: Grok 3, Grok 3 mini

Model Input ($/M) Output ($/M) Notes
Grok 3 $3.00 $15.00 Frontier tier
Grok 3 mini $0.30 $0.50 Cheap tier
Grok 3 image $0.07/image Image input

Grok's distinguishing feature in 2026 is real-time X/Twitter data integration — useful for trend-aware applications. Token prices are roughly Anthropic Sonnet 4.6 parity.

DeepSeek: V3, R1, and the price-disruption story

Model Input ($/M) Output ($/M) Cached input ($/M) Notes
DeepSeek V3 (671B MoE / 37B active) $0.27 $1.10 $0.07 Off-peak: $0.14/$0.55
DeepSeek R1 $0.55 $2.19 $0.14 Off-peak: $0.28/$1.10
DeepSeek Coder V3 $0.27 $1.10 $0.07 Code-tuned

DeepSeek introduced two pricing innovations that the industry partially copied: explicit off-peak pricing (50% discount 16:30–00:30 UTC) and per-token caching that the user does not have to mark up (automatic prefix detection). R1 at $0.55 input is roughly 1/30 of o3 for comparable reasoning benchmark performance — the single largest price/quality disruption of 2024–2026.

Meta Llama models via 3rd-party hosts

Llama-3.1, Llama-3.2 Vision, Llama-3.3, and Llama-4 are open-weight. Prices vary by host.

Model Together Fireworks DeepInfra Bedrock Groq
Llama 3.1 8B $0.18/$0.18 $0.20/$0.20 $0.06/$0.06 $0.22/$0.22 $0.05/$0.08
Llama 3.1 70B $0.88/$0.88 $0.90/$0.90 $0.35/$0.40 $0.99/$0.99 $0.59/$0.79
Llama 3.1 405B $3.50/$3.50 $3.00/$3.00 $2.70/$2.70 $5.32/$16.00 n/a
Llama 3.2 11B Vision $0.18/$0.18 $0.20/$0.20 $0.16/$0.16 n/a
Llama 3.2 90B Vision $1.20/$1.20 $0.90/$0.90 $2.00/$2.00 n/a
Llama 3.3 70B $0.88/$0.88 $0.90/$0.90 $0.35/$0.40 $0.72/$0.72 $0.59/$0.79

(All input/output $/M tokens, mid-2026.)

The pattern: DeepInfra and Groq are cheapest. Bedrock is most expensive but bundled with AWS enterprise agreements. Together/Fireworks sit in the middle with the best DX (LoRA hosting, prompt caching, JSON-mode support). Pick by your priority: cost (DeepInfra), latency (Groq, Cerebras), features (Together, Fireworks), enterprise (Bedrock).

Specialty silicon: Groq, Cerebras, SambaNova

Provider Hardware Speed advantage Pricing example (Llama 70B)
Groq LPU 6–10× faster decode than H100 $0.59 input / $0.79 output per M
Cerebras WSE-3 8–15× faster decode than H100 $0.60 / $0.80 per M for Llama 70B
SambaNova RDU 5× faster, large context Enterprise pricing

The cost equation: per-token they're often cheaper than H100 hosts, and they unlock latency budgets (sub-100ms time-to-first-token at 1k context) that H100 stacks cannot reach. For voice and real-time agents, specialty silicon is increasingly the only path to acceptable UX.

What's missing: pricing volatility and dated quotes

Every number in this section is mid-2026. Prices for open-weight models on third-party hosts move monthly; frontier model prices drop at major release cadence (quarterly to semi-annually). For investment-grade decisions, pull current prices from the provider's pricing page on the day of the decision. The relative shape (frontier ~$3–$15 input, cheap-tier ~$0.05–$0.20 input, reasoning ~$15–$60 output, batch 50% off, cache 80–95% off) is stable; the absolute numbers are not.


Reasoning-token pricing math (the o-series problem)

Reasoning models look like normal API endpoints on the pricing page. They aren't. They emit thinking tokens that bill as output but don't appear in the visible response, and the count is unpredictable. Three concrete examples make the problem visible.

Example 1: a coding question on o3 vs Sonnet 4.6

Prompt: "Refactor this 200-line Python function to use async and reduce DB roundtrips." Input ~500 tokens (the code).

Sonnet 4.6 response: 800 visible output tokens, no thinking.

  • Cost: 500 × $3/M + 800 × $15/M = $0.0015 + $0.012 = $0.0135.

o3 response (effort: medium): 4,500 thinking tokens + 800 visible output = 5,300 output tokens billed.

  • Cost: 500 × $15/M + 5,300 × $60/M = $0.0075 + $0.318 = $0.326.

Same input, same useful output. o3 costs 24× more. If your coding workload is 1M queries/year, that's $13,500 on Sonnet vs $326,000 on o3-medium. The question is whether o3 gets the answer right 24× more often. For routine refactoring, no. For complex debugging across an unfamiliar codebase, sometimes — and routing based on that distinction is where the savings live.

Example 2: thinking budget caps

Anthropic's thinking_budget: 8000 on Opus 4.x for a complex analysis prompt:

  • Worst case thinking: 8,000 tokens × $75/M = $0.60.
  • Plus regular output: 1,500 tokens × $75/M = $0.1125.
  • Plus input: 2,000 tokens × $15/M = $0.030.
  • Total per query: $0.74.

Same prompt on Opus without thinking enabled: 2,000 × $15/M + 1,500 × $75/M = $0.143 — five times cheaper. The thinking budget is a $0.60 ceiling, but the actual cost lands at the mean of how often the model uses the full budget. For Opus extended thinking on hard math, that mean is ~75% of the budget. Practical rule: budget 75% × cap as your expected per-query cost.

Example 3: DeepSeek R1 vs o3 at the same task

A GPQA-style scientific reasoning question, 800 tokens input.

Model Thinking tokens Visible output Total cost
o3 (effort: high) ~12,000 500 800 × $15/M + 12,500 × $60/M = $0.762
o3 (effort: medium) ~3,500 500 800 × $15/M + 4,000 × $60/M = $0.252
o3 (effort: low) ~800 500 800 × $15/M + 1,300 × $60/M = $0.090
DeepSeek R1 ~4,000 500 800 × $0.55/M + 4,500 × $2.19/M = $0.010
Claude Opus 4.x thinking=16k ~9,000 500 800 × $15/M + 9,500 × $75/M = $0.725
Gemini 2.5 Pro Deep Think ~5,000 500 800 × $1.25/M + 5,500 × $5/M = $0.029

For products where every penny counts and the marginal benchmark accuracy difference is within tolerance, DeepSeek R1 and Gemini 2.5 Pro Deep Think win by 25–75×. For mission-critical reasoning where the cost of a wrong answer dominates the model price, o3-high or Opus thinking are worth the surcharge. Routing makes both true at the same time.

The thinking-token explosion math

The arithmetic that determines reasoning economics is the ratio of thinking tokens to visible output tokens. Define R = thinking / visible. Reasoning model cost is roughly input × p_in + (visible × (1 + R)) × p_out. For R = 10 (o3-medium), this is 11× the output-cost contribution. For R = 30 (o3-high on hardest tasks), 31×. The visible output costs become rounding error.

Implication: when comparing reasoning models, R differs by a factor of 3–10× across providers for the same task. Cheap reasoning (DeepSeek R1) has higher R but lower per-token cost. Frontier reasoning (o3-high) has higher R AND higher per-token cost. Total cost is R × per_token, which compounds.

The reasoning routing pattern in production

The best-practice 2026 production routing for reasoning workloads:

  1. Classifier first. A small (1B–8B) classifier or cheap LLM call decides: is this query reasoning-heavy or chat-simple? Cost: $0.0001–$0.001 per classification.
  2. Easy queries → Sonnet 4.6 / Gemini Flash / Haiku 4.5. Cost: $0.001–$0.01.
  3. Medium reasoning → DeepSeek R1 / Gemini 2.5 Pro Deep Think. Cost: $0.01–$0.05.
  4. Hardest reasoning → o3-high / Opus thinking. Cost: $0.30–$1.00.
  5. Budget caps everywhere. No model ever runs with unbounded thinking.

A typical mix on a customer-support agent: 70% easy, 25% medium, 5% hard. Blended cost: 0.70 × $0.005 + 0.25 × $0.03 + 0.05 × $0.50 = $0.036/query. If you naively routed everything to o3-high: $1.00/query. 28× difference for the same blended quality.


Prompt caching pricing across providers

Prompt caching is the largest single cost lever for stable-system-prompt workloads — chat assistants, agents with long instructions, RAG systems with shared corpus context. The mechanics differ by provider; the dollar math differs by 5× across them.

How each provider's cache works

Provider Cache trigger Cache TTL Read price Write price Min prefix
OpenAI Automatic (1024+ tokens) 5–10 min idle ~25% of input Same as input 1024 tokens
Anthropic Explicit cache_control blocks 5 min default; 1 hour optional ~10% of input 1.25× input None
Google Explicit cachedContents API 1 hour default, configurable ~25% of input Same as input + storage fee 32k tokens (Pro), 1k (Flash)
DeepSeek Automatic prefix detection Hours ~25% of input Same as input None published
Fireworks Automatic Configurable 50% of input Same as input None
vLLM (self-host) Automatic radix cache Configurable; VRAM-bound Free Free None

The dimension that surprises people most is Google's storage fee — the only major provider that charges for cache occupancy. A 200k-token Gemini Pro cache held for 24 hours costs $21.60 in storage alone (0.2 × 24 × 4.5). Practical: only cache if you expect ≥20 reads per cache write.

Worked cost example: a 50,000-token system prompt across 100k daily messages

A customer-service AI with a 50k-token system prompt (policies, FAQs, tone guide) serving 100k daily messages. Input on every message ~500 tokens (the user query). Output ~300 tokens.

No caching (baseline):

Provider Daily cost
Sonnet 4.6 100k × (50,500 × $3/M + 300 × $15/M) = $15,600
GPT-4o 100k × (50,500 × $2.50/M + 300 × $10/M) = $12,925
Gemini 2.5 Pro 100k × (50,500 × $1.25/M + 300 × $5/M) = $6,463
DeepSeek V3 100k × (50,500 × $0.27/M + 300 × $1.10/M) = $1,397

With caching enabled:

Provider First call (cache write) Subsequent (cache read) Daily cost
Sonnet 4.6 (5-min cache) 50,000 × $3.75/M + 500 × $3/M = $0.189 50,000 × $0.30/M + 500 × $3/M = $0.0165 $1,654
GPT-4o auto-cache $0.127 50,000 × $0.625/M + 500 × $2.5/M = $0.033 $3,313
Gemini 2.5 Pro $0.064 + storage $4.50/hr × 24 = $108 50,000 × $0.31/M + 500 × $1.25/M = $0.0162 $1,728
DeepSeek V3 $0.014 50,000 × $0.07/M + 500 × $0.27/M = $0.0036 $363

Savings from caching:

Provider Without cache With cache Annual savings
Sonnet 4.6 $5.69M $604k $5.09M
GPT-4o $4.72M $1.21M $3.51M
Gemini 2.5 Pro $2.36M $631k $1.73M
DeepSeek V3 $510k $133k $377k

Caching pays for itself in the first hour of operation. Not enabling it is a five-to-seven-figure annual mistake.

Cache invalidation: the hidden tax

Every modification to the cached prefix invalidates the cache. Adding a new policy line, swapping the date in the system prompt, including the user ID in the prefix — all force a re-write at full cost. Practical engineering rules:

  1. Stable prefix first. Put truly stable content at the start of the prompt; volatile content (user context, current date, conversation history) at the end.
  2. Anthropic's 4 cache breakpoints. Use them to mark the boundary between stable and volatile sections. Cache reads up to the last unchanged breakpoint.
  3. Don't include user IDs in cached prefix. Each user gets their own cache, which becomes uneconomical at high tenant counts. Pass user identity through tool calls or a separate context section.
  4. Avoid timestamps in the prefix. "Current time: 2026-05-16 14:23:01" forces a cache miss every second.
  5. Version your system prompt explicitly. When you change it, expect 1–2 hours of cache miss penalty as the new version warms across regions.

Cache hit rate as a unit-economics metric

Production stacks track cache_hit_rate = cached_input_tokens / total_input_tokens. Good stacks hit 80–95%. Bad stacks hit <30% because of one of the engineering anti-patterns above. The metric belongs on the same dashboard as cost-per-task; a sudden drop indicates a system prompt edit or a routing change worth investigating.


Self-host capex/opex deep dive (B200, H200, GH200)

The headline self-host math earlier in this guide used 8× H100 SXM at $20/hour as the working example. The reality in 2026 is that H100 is no longer the default new build — B200 and H200 dominate new orders, and GH200/GB200 supersets are entering production. The economics shift accordingly.

Capex: what each platform costs to buy in 2026

Platform Configuration Street price (mid-2026) Power draw
8× H100 SXM HGX 80 GB HBM3, 700W TDP each $230k–$280k ~10 kW peak
8× H200 SXM HGX 141 GB HBM3e, 700W each $300k–$360k ~10 kW peak
8× B200 SXM HGX 192 GB HBM3e, 1000W each $450k–$550k ~14 kW peak
1× GH200 Grace Hopper 96 GB HBM3 + 480 GB LPDDR5X $40k–$55k ~1 kW
1× GB200 NVL2 2× B200 + Grace CPU $130k–$160k ~3.5 kW
GB200 NVL72 rack 72× B200 + 36× Grace $3.0M–$3.5M ~120 kW
8× MI300X (AMD) 192 GB HBM3 each, 750W $200k–$260k ~9 kW peak
8× MI325X (AMD) 256 GB HBM3e, 1000W $260k–$330k ~12 kW peak
8× Intel Gaudi 3 128 GB HBM2e, 600W $170k–$220k ~7 kW peak
8× L40S 48 GB GDDR6, 350W $80k–$110k ~4 kW peak

Used market (mid-2026):

  • H100 SXM used: $20k–$28k per GPU, down from $35k peak in 2023.
  • H200 used: $32k–$38k per GPU, limited supply.
  • A100 80GB used: $9k–$13k. The A100 is now a value play for non-frontier inference.
  • L40S used: $7k–$10k. Sweet spot for small-model fleets.

Opex: power, cooling, colo

For the bought-and-racked case:

  • Power. An 8× H100 box at full load draws ~10 kW. At enterprise electricity rates ($0.08–$0.12/kWh in the US, $0.20–$0.35/kWh in Europe, $0.04–$0.08/kWh in Texas/Iceland), annual power for one 8× H100 box is $7,000–$25,000. For B200 at 14 kW, $10,000–$36,000.
  • Cooling. Air-cooled adds 30–50% to power draw via CRAC efficiency. Liquid-cooled (now standard for B200 and GB200) adds ~10% but requires facility-grade water loops.
  • Colo. Tier-3 colo at $300–$1,000 per kW-month. An 8× H100 box at 10 kW costs $3k–$10k/month in colo. B200 at 14 kW: $4.2k–$14k/month.
  • Networking. InfiniBand HDR/NDR switching adds $30k–$100k per pod. NVLink within the box is included; cross-node bandwidth costs extra.

Realistic 3-year TCO for one 8× H100 box owned and racked:

Line item Year 1 Year 2 Year 3 3-year total
Hardware (amortised) $85k $85k $85k $255k
Power $15k $16k $17k $48k
Cooling $5k $5k $6k $16k
Colo space $60k $63k $66k $189k
Maintenance + support $25k $25k $25k $75k
Ops engineer (allocated) $80k $84k $88k $252k
Total $270k $278k $287k $835k

Versus 3 years of 24×7 rental at $20/hour: $175k × 3 = $525k. Owned-and-racked costs 60% more than rental in this baseline. The break-even shifts in favour of ownership when:

  • You can secure $0.04/kWh electricity (drops power+cooling to ~$10k/year).
  • You're in a market where $20/hour H100 rental isn't available (rental at $30+/hour swings the math).
  • You already have datacenter capacity ($0 colo).
  • You spread the ops engineer across 10+ boxes ($25k/box instead of $80k).
  • You can run the hardware 4+ years (extends amortisation).

GH200 vs B200 break-even

Grace Hopper (GH200) packs a single H100 with 480 GB of LPDDR5X memory addressable over NVLink-C2C at 900 GB/s. The cost-per-token math vs an 8× B200 HGX system depends on workload.

Long-context inference (>200k context): GH200's 576 GB unified memory pool fits an 8B–13B model in HBM with hundreds of GB of KV cache spillover into LPDDR5X. For 1M-context workloads, GH200 wins on cost-per-token because B200 boxes hit KV cache eviction earlier.

Standard 70B inference at moderate context: B200's HBM bandwidth (8 TB/s per GPU) crushes GH200 (3 TB/s effective when KV spills to LPDDR). B200 wins by 3–5× per token.

405B serving: B200 with 192 GB × 8 = 1536 GB fits 405B at FP8 (~400 GB weights) with room for batching. GH200 needs ≥8 nodes to do the same, paying for NVLink switch cost. B200 dominates.

MoE serving: Mixed. Mixtral-style sparse-MoE benefits from GH200's deep memory; dense expert layers want B200 bandwidth. A 50/50 mix is common in 2026 fleets.

Break-even rule of thumb: under 64k context, B200. Over 256k context, GH200. Mixed workloads, hybrid fleets with routing.

Real per-million-token cost on each platform at production utilisation

Assuming 60% utilisation (typical for a well-tuned production fleet), Llama-3.3 70B at FP8, 1500-input/300-output query shape:

Platform Hourly cost amortised Tokens/sec $/M output tokens
8× H100 SXM (rental, $20/hr) $20 3,000 $1.85
8× H100 SXM (owned) $32 3,000 $2.96
8× H200 SXM (rental, $30/hr) $30 4,500 $1.85
8× B200 SXM (rental, $60/hr) $60 9,000 $1.85
8× MI300X (rental, $15/hr) $15 2,800 $1.49
1× GH200 (rental, $4/hr) $4 800 $1.39
8× L40S (rental, $6/hr) $6 600 $2.78

The per-million-token cost on rental hardware converges near $1.50–$2.00 because rental pricing equilibrates against the marginal token throughput. The differences open up at the edges: ownership, longer contracts, specialty workloads, and underutilisation.


Hidden cost vectors

The pricing pages give per-token rates. The cost stack above per-token includes seven categories that don't appear on any pricing page.

Egress and data-transfer

Sending images, audio, and large prompts out of your VPC to a hosted API has cost. AWS egress is $0.05–$0.09/GB depending on region; Azure and GCP are similar. For a video-understanding workload processing 100k 30-second clips per day (~10 MB each), egress is 1 TB/day × $0.09 = $90/day = $33k/year. Same workload via Bedrock (no egress) saves the $33k.

Hosted providers in cloud-private-link or VPC-peered setups avoid egress. AWS Bedrock, Azure OpenAI Service, and Google Vertex are inside cloud boundaries. Direct OpenAI/Anthropic via public API is not.

Observability and logging

Tracing every LLM call (prompt, response, tokens, latency, retries) at 1 KB per span × 10M calls/month = 10 GB/month of spans. Datadog at $1.27 per million spans: $13/month. Sound trivial. The catch: full-fidelity prompt+response logging is 5–50 KB per call, not 1 KB. At 50 KB × 10M × $0.10/GB ingest = $50k/month for one product. Observability costs run 5–15% of inference spend on AI-first products.

Eval infrastructure

Continuous eval pipelines (RAGAS, LangSmith, BrainTrust, custom) run regression suites on every prompt/model change. A 500-prompt eval × 3 candidate models × 4 metrics × 100 releases/year = 600k extra LLM calls/year. At $0.01/call average: $6k/year. Add eval-judge calls (LLM-as-judge at $0.05/call for grading): another $30k/year. Eval cost = 2–10% of inference spend for teams that take eval seriously.

Guardrail layer

Input/output safety filters (OpenAI Moderation, Anthropic's constitutional classifier, Lakera Guard, Protect AI, NeMo Guardrails) cost per-call. OpenAI Moderation is free; Lakera Guard is ~$0.50–$1/M tokens. For a 100M-call/year product running guardrails on every input + every output: 200M calls × $0.75/M tokens × ~500 tokens/call avg = $75k/year. See production safety guardrails.

Retries and fallbacks

A 1% failure rate with one retry attempt adds 1% to cost. A 1% failure rate with three retries adds 3%. A misconfigured agent that retries 10× on tool failure can add 10% silently. Production stacks should log retry rates as a first-class metric and alert on anomalies.

Fallback routing — when the primary model fails, route to a secondary — adds cost too: the failed call is paid for, plus the fallback. For 99.9% reliability across two providers (each 99.5%), expected cost is ~1.005× single-provider.

Vendor lock-in cost

Switching from OpenAI to Anthropic isn't free: prompts that were tuned for GPT-5's quirks may underperform on Sonnet 4.6 by 5–15% on your eval. Re-tuning a production prompt costs engineer time ($300–$1,000) and the eval cycle ($500–$5,000 per major prompt). Multiply by your prompt count to estimate switching cost.

The implied "stay" cost: vendor lock-in means you don't capture future price drops from competitors. If your competitor drops their prices 30% and you don't switch, you're paying a 30% premium for inertia.

Cold-start and idle capacity

For self-hosted: a model loaded into VRAM is consuming GPU rent. A 70B model takes 30–60 seconds to cold-load; aggressive scale-to-zero saves money but adds latency. Production stacks typically keep one replica warm always (paying for it) and scale up for load.

For hosted APIs: rare, but cold-start manifests as occasional first-call latency spikes on niche models. Doesn't show on the bill.


Enterprise procurement: Bedrock vs Azure OpenAI vs Vertex vs direct

For organisations spending $250k+/year on inference, procurement path matters as much as model choice. Five paths, five tradeoffs.

AWS Bedrock

Coverage: Claude (Anthropic), Llama (Meta), Mistral, Cohere, Stability, Titan (AWS). No OpenAI. Adds AWS-managed models (Amazon Titan, Amazon Nova) at competitive prices.

Pricing: Generally 0–10% premium over direct provider rates. Llama-3.3 70B at $0.99/M on Bedrock vs $0.88/M on Together. Claude Sonnet 4.6 at $3.00/$15.00 on Bedrock (parity with Anthropic direct).

Advantages: In-VPC inference (no egress), AWS PrivateLink, IAM-based access control, CloudTrail audit logs, committed-use discounts via AWS EDP. Inferentia2 and Trainium2 deployment for cost-sensitive workloads.

Disadvantages: Limited to Bedrock-supported models. No GPT-5/o3. Adapter and fine-tune options vary by model. New Anthropic releases often lag direct by 2–6 weeks.

Azure OpenAI Service

Coverage: OpenAI's full catalogue (GPT-5, o3, GPT-4o, embeddings, DALL-E, Whisper, TTS). Microsoft adds Phi family models on the same platform.

Pricing: Parity with OpenAI direct. Microsoft Enterprise Agreement discounts apply (5–25% typical).

Advantages: SOC 2, HIPAA, FedRAMP coverage. EA leverage for committed-use. Azure VNet integration. Customer data not used for training (default).

Disadvantages: Per-deployment capacity must be requested and approved. Region availability lags OpenAI direct by weeks. New model availability lags by 1–4 weeks. Quota approval can take days.

Google Cloud Vertex AI

Coverage: Gemini family + third-party models (Claude via Anthropic on Vertex, Llama, Mistral). Vertex Model Garden hosts 100+ open-weight models.

Pricing: Gemini parity with AI Studio. Third-party models priced ~5–10% above direct.

Advantages: TPUs for self-managed deployments (Trillium TPU v5p competitive with H100 for inference). Strong batch/async tooling via Vertex Pipelines. BigQuery and Spanner integrations.

Disadvantages: Documentation quality lower than AWS/Azure. Quota management opaque. Smaller community.

Direct provider APIs (OpenAI, Anthropic, etc.)

Pricing: Reference rates. No cloud-bundle discount.

Advantages: First access to new models. Best documentation. Direct support relationships. No quota approval friction.

Disadvantages: No VPC isolation by default (private deployments available at enterprise tier). Egress costs from cloud. Separate billing relationship from cloud spend.

Multi-cloud LLM gateways (LiteLLM, OpenRouter, Portkey, Vercel AI Gateway)

Coverage: 100+ models across all providers behind one API.

Pricing: OpenRouter takes ~5% spread; LiteLLM and Portkey are self-hosted (free) or SaaS with subscription. Vercel AI Gateway: per-call surcharge.

Advantages: Single API, easy provider switching, automatic fallback routing, usage analytics. Hedges against any one provider's outage or price hike.

Disadvantages: Extra latency (10–100 ms). Cost overhead. Some providers' newest features (Anthropic's prompt caching, OpenAI's structured outputs) may take time to propagate.

When to pick which

Scenario Pick
Already deep on AWS Bedrock
Already deep on Azure Azure OpenAI
GCP-native; want TPU option Vertex
Need bleeding-edge models day one Direct (OpenAI, Anthropic)
Multi-provider strategy Gateway (LiteLLM/OpenRouter/Portkey)
Strict compliance (HIPAA, FedRAMP) Cloud bundle (Bedrock, Azure, Vertex)
Open-weight at lowest cost Direct to host (Together, Fireworks, DeepInfra)
Specialty silicon Direct (Groq, Cerebras)

The realistic enterprise stack uses 2–3 of the above. Cloud bundle for compliance-bound workflows, direct or gateway for experimentation, specialty silicon for latency-critical paths.


Inference at scale: Inferentia2, Trainium2, and custom silicon pricing

Hyperscaler-designed inference silicon has crossed the threshold from "interesting niche" to "production cost-saver" in 2026. Three flavours matter.

AWS Inferentia2 and Trainium2

Inferentia2 instances (inf2.24xlarge, inf2.48xlarge) host AWS-designed inference chips at 30–50% lower per-token cost than equivalent H100 instances for supported models. The catch: model support is limited. Llama, Mistral, Stable Diffusion are well-supported. Custom architectures need Neuron SDK porting (engineering cost: 1–3 weeks per model family).

Trainium2 (trn2.48xlarge) targets training but supports inference for the same model families at competitive rates. Bedrock uses Trainium2 under the hood for some Amazon Nova model deployments.

Pricing example: Llama-3.3 70B on inf2.48xlarge (12× Inferentia2 chips) at $9/hour reserved (vs ~$22/hour for equivalent p5.4xlarge H100). At 60% utilisation, $0.65/M output tokens vs $1.85/M on H100. About 65% cheaper.

When it pays off: Bedrock-routed Llama or Titan at scale. Mistral and Cohere on Bedrock partly run on Inferentia2 transparently.

Google TPU v5p and Trillium (v6p)

TPU v5p pods are price-competitive with H100 for inference workloads on JAX/XLA-friendly architectures. Trillium (v6p) raises the bar — 4.7× FP8 perf vs v5p.

Pricing example: TPU v5p slice (4 chips) at $4.20/hour. Llama-3.3 70B inference at ~2,500 tokens/sec at FP8 = $0.47/M output tokens. About 75% cheaper than H100 rental.

The catch: software stack. PyTorch/XLA works but isn't seamless; vLLM and SGLang have varying TPU support. Best for teams that can invest in JAX or are running Gemini variants natively on Google's stack.

Anthropic on Trainium2

Anthropic's published deal with AWS: significant Claude inference capacity running on Trainium2 clusters. This is invisible to API users (you call Anthropic's API, the metal underneath is mixed). The relevance: it's what allows Anthropic to price Sonnet 4.6 at $3/$15 with healthy margins — Trainium2 is cheaper per FLOP than H100 for Anthropic at their volume.

Microsoft Maia and Azure Cobalt

Microsoft's first-gen Maia 100 AI accelerator entered production for Azure OpenAI in late 2024. Maia 200 (rumoured 2026) extends capacity. Customer-facing pricing on Azure OpenAI doesn't differentiate Maia vs H100 deployments; the savings flow to Microsoft's gross margin.

Meta MTIA, Tesla Dojo, Cerebras, Groq, Tenstorrent

Meta MTIA: internal-only for Meta's own inference. Tesla Dojo: speculative. Cerebras WSE-3 and Groq LPU: customer-facing, priced per-token. Tenstorrent: enterprise sales, custom deployments.

For external customers, only AWS (Inferentia2/Trainium2), GCP (TPU), Groq, and Cerebras offer custom silicon at API or rental tiers in 2026. The 30–60% cost advantages over H100 are real for supported models.

The custom-silicon decision matrix

Workload Best custom-silicon path Savings vs H100
Llama/Mistral on AWS at scale Inferentia2 30–50%
Gemini-family workloads Vertex on TPU 20–40%
Latency-sensitive small-model Groq 20–60% (and 5–10× latency)
Massive context (1M+) Cerebras 30–50%
Frontier proprietary (Claude, GPT) Use the API; silicon is hidden Already priced in

Per-million-token unit economics across 15 models

The single most-requested table for 2026 cost planning. Reference prices, plus production throughput numbers from published benchmarks and our own measurements. All entries mid-2026; all dollars per million tokens unless noted.

Model Tier Input $/M Output $/M Tokens/sec (decode) Hardware Best for
GPT-5 Frontier API $5.00 $20.00 ~80 Hidden Hardest general
GPT-4o Mid API $2.50 $10.00 ~140 Hidden Mixed multimodal
GPT-4o mini Cheap API $0.15 $0.60 ~180 Hidden High-volume chat
o3 (medium effort) Reasoning $15.00 $60.00 ~50 Hidden Hard reasoning
Claude Opus 4.x Frontier API $15.00 $75.00 ~70 Hidden Mission-critical
Claude Sonnet 4.6 Mid API $3.00 $15.00 ~110 Hidden Production default
Claude Haiku 4.5 Cheap API $0.80 $4.00 ~200 Hidden Fast chat
Gemini 2.5 Pro Mid API $1.25 $5.00 ~140 TPU v5p Multimodal mid
Gemini 2.5 Flash Cheap API $0.075 $0.30 ~280 TPU v5p Cheapest frontier-class
DeepSeek V3 (API) Cheap API $0.27 $1.10 ~80 (MoE) Hidden Cheap general
DeepSeek R1 (API) Reasoning $0.55 $2.19 ~60 Hidden Cheap reasoning
Llama 3.3 70B (Groq) Open-weight $0.59 $0.79 ~280 LPU Latency-sensitive
Llama 3.3 70B (Together) Open-weight $0.88 $0.88 ~85 H100 Default open-weight
Llama 3.1 8B (DeepInfra) Open-weight $0.06 $0.06 ~250 H100 Cheapest viable
Mixtral 8x22B (Fireworks) Open-weight MoE $1.20 $1.20 ~120 (MoE) H100 Cheap MoE

How to read this table: the per-million-token rate is the headline cost. Tokens-per-second decode determines latency and per-second throughput, which determines self-host break-even. A model at $0.10/M with 50 tokens/sec costs the same per token as a model at $0.50/M with 250 tokens/sec, but the second one ships responses 5× faster — UX matters for user-facing products.

Cost per typical query (1500 input + 300 output)

Model Cost per query Daily cost at 100k queries
GPT-5 $0.0135 $1,350
GPT-4o $0.00675 $675
GPT-4o mini $0.000405 $40.50
o3-medium (incl. thinking) $0.265 $26,500
Claude Opus 4.x $0.045 $4,500
Claude Sonnet 4.6 $0.009 $900
Claude Haiku 4.5 $0.0024 $240
Gemini 2.5 Pro $0.00375 $375
Gemini 2.5 Flash $0.000203 $20.25
DeepSeek V3 $0.000735 $73.50
DeepSeek R1 (incl. thinking) $0.0108 $1,080
Llama 3.3 70B (Together) $0.00158 $158

The 4-order-of-magnitude spread between Gemini 2.5 Flash and o3-medium for the same query shape is the single biggest cost lever in the stack. The right answer is rarely "always use the cheapest" or "always use the best" — it's routing.


Multi-tenant cost-allocation patterns

If you run a B2B product with N customers, "what does Customer X cost us this month" is not a question your inference bill answers directly. Six patterns for allocation.

Pattern 1: per-call tagging

Add metadata.customer_id to every LLM call. Most provider SDKs support metadata fields (OpenAI's user, Anthropic's metadata). Aggregate by tag for monthly attribution. Granular but requires consistent tagging discipline.

Pattern 2: dedicated API keys per tenant

Issue one API key per customer (or per logical tenant). Bills come pre-segmented. Works at small tenant counts (10–500); doesn't scale to 10k tenants. Operational burden: key rotation, revocation, monitoring.

Pattern 3: gateway-level metering

LLM gateway (LiteLLM, Portkey, Helicone) intercepts every call, records customer_id from headers, and writes to a billing database. Decouples cost tracking from provider-specific features. Industry standard for B2B AI-as-a-feature.

Pattern 4: cost-per-feature accounting

Don't allocate by customer; allocate by product feature. "Summarisation costs $40k/month, chat costs $180k/month, agentic flows cost $90k/month." Useful for engineering prioritisation; less useful for customer-success conversations.

Pattern 5: token budget per tenant

Cap each tenant's monthly token budget. When approaching the cap, throttle or upgrade. Common in productivity tools (Copilot, Cursor): "X messages per day on Pro tier."

Pattern 6: marginal-cost markup pricing

If a customer's usage costs you $30/month, charge them $90 (3× markup). Industry standard for B2B SaaS AI features. Margins compress as inference prices drop; review markups quarterly.

The fairness problem in shared self-hosted

If you self-host one cluster shared across customers, allocation by token count is fair but loses the bursty-tenant subsidy effect. A customer that drives 80% of P95 capacity is more expensive to serve than their token count suggests because they force you to over-provision. Production allocation often blends 70% token-share + 30% peak-share.

Showback vs chargeback

Showback: report cost-per-customer internally for visibility, but don't bill. Most early-stage B2B AI products. Cheapest to implement.

Chargeback: customers see their consumption and pay accordingly. Requires accurate per-call cost calculation including cache hits, prefix discounts, retry overhead. Operationally heavy but the only honest model for variable AI usage.

Internal teams and intercompany allocation

For internal AI consumers (the marketing team uses LLMs for content; product uses them for in-app features), enterprises usually allocate inference cost through cost centers. Maintaining accurate allocation requires every prompt-issuing service to declare its cost center. Add this as a metadata field on day one; retrofitting after 18 months is painful.


FinOps for LLMs

The FinOps Foundation has formalised cloud cost discipline since 2019. LLM inference inherits most of those practices and adds a few that are LLM-specific.

The five FinOps disciplines applied to LLM spend

1. Inform. Track inference spend in a finance-readable system (Datadog, Vantage, CloudHealth, Apptio). Tag every call with customer, feature, environment. Build a dashboard that answers "what did each model cost this week" in <30 seconds. Update daily.

2. Optimise. Run the cost-optimisation playbook above (smaller model, route by difficulty, cap outputs, cache, quantize). Maintain a cost-optimisation backlog like a security backlog. Score each item by expected savings.

3. Operate. Set guardrails: per-tenant spend caps, anomaly alerts at 1.5× trailing 7-day average, model-mix alerts (if o3 traffic spikes 5×, something's misrouted). Treat cost spikes like incidents — page someone.

4. Plan. Forecast next-quarter inference spend with explicit assumptions. Re-plan when traffic changes by 50% or when major prices change.

5. Govern. Decide which teams can use which models. Production o3 access requires approval. Frontier-tier defaults to off; teams opt in with justification.

Unit economics dashboards that matter

Three dashboards every AI-first product team should run:

Per-task economics: cost per resolved ticket, cost per generated artifact, cost per user-session. Plotted weekly; trended monthly.

Cost-per-active-user: total inference spend ÷ MAU. Tracks whether you're winning the cost-efficiency battle as your user base grows.

Cost-per-revenue-dollar: inference cost / revenue. Watch this trend toward 5–15% in healthy AI-first SaaS. Above 30%, the product economics are at risk.

Reserved capacity vs on-demand

Hosted APIs: most providers offer committed-use discounts at 6-month or 1-year terms. OpenAI, Anthropic, Google all do 15–30% discounts at $250k+/year commit. Negotiate at renewal; don't sign 3-year terms in a market with 5–10×/year price compression.

Self-hosted: 1-year H100 reserves on AWS p5 are 30–50% off on-demand. Lambda 1-year reserves on H100 are 20–40% off. Risk: you're locked in even if model preferences change.

The right reservation level: cover P50 traffic with reserves; burst above P50 with on-demand. Captures most of the discount with most of the flexibility.

Treating inference cost as a product KPI

The companies that optimise inference cost best treat it as a top-five product metric, not a finance line item. Cost-per-resolved-task moves on every product change, every model upgrade, every prompt edit. Tracking it weekly catches regressions early; reviewing it monthly informs roadmap prioritisation.

The teams that fail at this treat inference cost as "the finance team's problem." They discover the issue when CFO asks why infra spend doubled this quarter. By then it's an emergency, not an optimisation. Move the metric upstream.


The bottom line

The problem is input/output asymmetry compounded by reasoning amplification and traffic shape variance — the same model can cost 20× more or less per resolved task depending on choices you make at design time, not after launch. The solution is treating inference as a unit-economics question first and a quality question second: route by query difficulty, cap outputs, enable prefix caching, and pick the cheapest tier that meets your quality bar. The biggest lever is model routing — sending the 70% of easy queries to a model 10–100× cheaper than your frontier tier typically beats every other optimisation combined.

  • Forecast cost at product design with the formula users × messages × tokens × price, then multiply by 2 for safety and retries.
  • The hosted-vs-self-hosted crossover sits at $5–10M/year inference spend; below that, the operational tax dominates the hardware savings.
  • Reasoning models are 10–50× the per-request cost of standard chat — gate them behind an explicit router, not a default.
  • Batch APIs give an unconditional 50% discount for anything that doesn't need a user waiting; most products waste this.
  • Audit retry policies before audit anything else; runaway loops cause more cost incidents than gradual growth.

For the privacy side of the same decisions, see AI chatbot privacy. For the safety controls that often live in the same gateway as your cost guardrails, see production safety guardrails.


FAQ

Should I use OpenAI, Anthropic, or open-weight? For most products in 2026, mixed: cheaper queries on Gemini Flash / GPT-4o-mini / Claude Haiku / open-weight via Together, harder queries on the frontier tier. Pure-OpenAI or pure-Anthropic is fine for simplicity; pure-open-weight needs commitment to operational maturity.

When does self-hosting actually pay off? $5–10M/year inference spend is the rough threshold. Below: hosted APIs win on simplicity. Above: self-hosting starts to look attractive if you have steady traffic and operational capacity.

Are reasoning models worth the cost? For high-value, hard tasks: yes. For everyday chat: no. Route by query type.

How do I forecast cost during product design? Estimate: average tokens per request × requests per user per month × users × token price. Multiply by 2 for safety. Compare to revenue per user. If the ratio is below 3:1, the product math is shaky.

What's the cheapest way to test ideas before paying? Use the free tiers of major providers (GPT-4o-mini free tier, Claude free tier, Gemini free tier) for prototypes. For development, $20/month chat subscriptions cover most usage. Pay-as-you-go API tiers have no minimum.

How do hosted-API providers make money at these prices? Volume + utilization. They run at 60–80% sustained utilization across a fleet, well above what any single customer can achieve. Their per-token cost is below what they charge; the margin is the gap. As open-weight models commoditise the base, providers compete on speed, latency, and specialty features.

Is Gemini Flash really that cheap? Yes. $0.075/M input is 1/200 of Claude Opus and 1/100 of GPT-5. For high-volume chat that doesn't need frontier reasoning, it's the price-performance leader in 2026.

Are caches secure for sensitive data? Prefix caching reuses cached KV across requests. For multi-tenant deployments, ensure cache is keyed by tenant + adapter to prevent cross-tenant leakage. Most production stacks handle this correctly.

How do I budget for spikes? Set up cost alerts at 1.5× normal daily. Auto-scaling on hosted APIs scales transparently but you pay for it; self-hosted needs proactive capacity. Reserve a per-day spend cap to prevent runaway bills.

Should I use AWS Bedrock, Azure OpenAI, or direct provider APIs? Direct providers are usually 5–10% cheaper. Cloud-hosted (Bedrock, Azure OpenAI) is worth it if you're already deep in a cloud ecosystem, have compliance requirements that need cloud-VPC residency, or have committed cloud spend you need to use.

Is the Groq / Cerebras LPU stuff cost-effective? Per-token, often yes. Per-token at high speed is the differentiator. If your workload values latency (real-time agents, voice), they're cheaper than achieving the same latency on H100s.

What's the right ratio of LLM cost to other infrastructure cost? For an AI-first product, LLM inference is typically 30–60% of all infrastructure cost. Higher than that and you have a margin problem. Lower than that and you're either using cheaper models than you should or your traffic isn't fully AI-dependent.

How do I reduce token usage in agents? Compaction (summarise old turns), sliding window history (keep last N turns), structured memory (store extracted facts, not raw turns), and parallel tool calls (issue many in one turn rather than serially). Can reduce per-task cost by 3–10×.

Are batch APIs worth using? Yes. OpenAI, Anthropic, and Google offer batch APIs at 50% discount with 24-hour latency. For non-realtime work (offline processing, dataset generation, eval), the discount is free money.

Cost trajectory: will this all keep getting cheaper? Yes, at 5–10× per year compounding through 2026. The trend continues; budget conservatively for 1-year ahead, aggressively for 3-year ahead.

How do I estimate input vs output token mix before launch? Run a representative sample (100–1000 requests) through your candidate model and log token counts. Most chat workloads land at 3–10× more input than output (history + context dominates). Reasoning workloads invert that — output (including thinking tokens) is 5–10× input. Get this ratio right before forecasting; getting input/output ratio wrong by 2× changes monthly cost by 30–50%.

Should I cache embeddings or regenerate them? Cache. Embeddings are deterministic for a given model version and input. Store them in a vector DB or KV store keyed by content hash. Regenerating costs $0.0001/embedding × millions of chunks = thousands of dollars. The only time to regenerate is on embedding-model upgrade.

What's prefix caching and how much does it actually save? Prefix caching reuses cached KV state for shared prompt prefixes (a 2000-token system prompt repeated across millions of calls). Anthropic offers it as cache_control with up to 90% discount on cached tokens; OpenAI auto-caches similar prefixes with 50% discount; self-hosted vLLM caches automatically. Real savings: 20–40% of total input cost on stacks with stable system prompts. Free money — turn it on.

How does Mixture-of-Experts pricing differ from dense models? MoE models (Mixtral, DeepSeek V3, Gemini's MoE architectures) have many parameters total but activate only a fraction per token. DeepSeek V3 is 671B total params, 37B active — costs price closer to a 37B dense model than a 671B one. The activation ratio determines per-token compute and price. See MoE serving for the implementation details.

What about embedding model costs separately? Embeddings are 10–100× cheaper than chat. OpenAI's text-embedding-3-small is $0.02/M tokens, Cohere's embed-v3 is $0.10/M, Voyage AI is $0.12/M. For a 1M-document RAG corpus at 500 tokens average per chunk, embedding is $10–$60 one-time. Re-embedding on model upgrades or chunking changes is the recurring cost.

Can I negotiate enterprise discounts realistically? Yes. At $20k+/month spend, every major provider has account managers and will discuss committed-use pricing. Typical 2026 discounts: 15–25% at $50k/month, 30–40% at $250k/month, 50%+ at $1M+/month. Negotiate the floor price; the ceiling rarely matters because you wouldn't hit it.

Is there real cost difference between API key types (consumer Pro vs API)? Yes. ChatGPT Plus / Claude Pro / Gemini Advanced ($20–$30/month) are unlimited-ish chat for one user; they're the cheapest path for individual usage. API access bills per token and has no monthly cap; cheaper than Pro at low volume, more expensive at very high single-user volume. Programmatic use requires API; chat use rarely.

How do I budget for spikes vs steady state separately? Two line items: a steady-state floor based on your P50 daily load, and a spike reserve at P95–P99. Hosted APIs handle this automatically (you pay per call). Self-hosted needs explicit capacity planning. A common pattern: own enough hardware for steady state, burst to hosted APIs for spikes — works if your model is API-compatible.

Does multimodal cost more on output too, or just input? Mostly input. Text-output from vision/audio queries is priced like text output. Some models charge an "image generation" output rate separately ($0.04/image for DALL-E 3 at 1024×1024). For voice output (TTS), OpenAI charges $15/M characters for tts-1, $30/M for HD. See multimodal serving for cost structure across modalities.


Extended FAQ

Why does o3-high cost so much more than o3-medium for the same prompt? o3's reasoning_effort parameter scales the model's internal thinking budget. Low: ~500 thinking tokens. Medium: ~2,000–3,500. High: ~8,000–20,000. Each thinking token bills as output at $60/M. A high-effort response can emit 30× more output tokens than low-effort for marginal accuracy gains on most tasks. Test the effort tier on your eval before defaulting to high in production.

Is Anthropic's 1-hour cache TTL worth the 2× write cost? Math: write cost is 2× regular input rate (instead of 1.25×). Cache reads are still ~10% of input. Break-even: any prefix re-read more than ~10 times in the hour beats the 5-minute tier (which can be evicted mid-session). For high-traffic agents with stable prompts, 1-hour cache is almost always cheaper.

How do I prevent runaway costs from a single buggy customer? Set per-tenant spend caps in your LLM gateway. LiteLLM, Portkey, and Helicone all support this natively. Hard-fail at 1.5× the customer's daily plan limit. Alert at 1.2×. Pages someone if a single tenant exceeds $1k/day or 10× their 7-day trailing.

What's the cheapest path for embeddings at billion-scale? DeepInfra hosts bge-large at $0.005/M tokens (1/20 of OpenAI's $0.10). Self-hosting bge-large or e5-mistral on a single L40S ($6/hour) at 5,000 embeddings/sec handles 432M embeddings/day at $144/day. For >1B/day, self-host wins; below 100M/day, hosted is cheaper after you factor ops.

Should I tune temperature to save cost? No direct cost effect — temperature affects sampling, not token count. Indirect: lower temperature produces more deterministic outputs that cache better in application-level response caching. If you cache LLM responses by prompt-hash, deterministic settings boost hit rates and reduce calls.

Are reasoning models worth it for code review specifically? Mostly yes for hard reviews (architecture, security, race conditions); mostly no for style/lint review where deterministic linters dominate. Production pattern: run a cheap model first to triage, then escalate the 5–15% of complex reviews to a reasoning model. Cost-per-review drops 60–80% vs always-reasoning.

What's the right cost-per-task target for an AI-first B2B product? Aim for inference cost ≤10% of customer ACV. A $5,000/year ACV customer can support up to $500/year in inference cost = ~$1.40/day. For chat-style consumption (50 messages/day), that's $0.028/message — Sonnet 4.6 territory. For agent workloads (5 LLM calls/task, 10 tasks/day), $0.028/task means routing to cheap models for most steps.

How accurate is OpenTelemetry GenAI semantic convention adoption in 2026? Mature. Major SDKs (Anthropic, OpenAI, Vertex, Bedrock) emit OTel spans natively or via lightweight wrappers (Langfuse, Helicone, LangSmith). Standard attributes: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_input_tokens. Standardised cost analytics across providers became trivially possible in 2025; if your stack doesn't emit OTel-shaped spans yet, that's the highest-leverage observability investment.

Can I get discounts on Anthropic / OpenAI by buying through a reseller? Rarely worth it. Direct enterprise discounts from the providers themselves are typically deeper than reseller margins. Resellers add value via consolidated billing across many SaaS products, FX management, or in-country compliance. The pricing arbitrage opportunity has narrowed since 2023.

What's the per-call overhead of a multi-cloud LLM gateway? LiteLLM (self-hosted): ~5–15 ms added latency, ~$200/month at 10M calls for hosting. Portkey (managed): 20–50 ms added latency, $0.01–$0.05 per 1k calls. OpenRouter: 50–150 ms added latency, 5% markup. The overhead is justified by single-API simplicity and automatic fallback routing; quantify against your latency budget.

How do I model the cost of fine-tuning vs continuing to prompt? Fine-tuning amortises one-time training cost over N future calls. Break-even: (training cost) / (per-call savings) = N calls before fine-tune pays off. Example: $5k LoRA training on Llama-3.3 8B saves $0.005/call vs prompting GPT-4o. Break-even at 1M calls. If you expect 5M+ calls in the next 12 months, fine-tune. Otherwise, stay on prompting.

Is there a price war coming for reasoning models? Already in progress. DeepSeek R1 at $0.55/$2.19 is 25–50× cheaper than o3 for comparable performance on most benchmarks. Anthropic's pricing on extended-thinking Sonnet 4.6 is competitive with mid-tier reasoning. Expected trajectory: reasoning prices drop 5–10× by mid-2027 as more open-weight reasoners ship. Don't sign multi-year reasoning model commitments.

What's the worst inference cost antipattern you see in production? Tied. (a) Agents that don't compact history, paying for the entire conversation transcript every turn. (b) RAG systems that retrieve 20 chunks of 1k tokens each and stuff them all into the prompt instead of reranking down to 3–5. Each can be a 3–10× cost multiplier vs the disciplined version.

How do batch APIs affect latency-sensitive workloads? Batch APIs (50% off list) complete asynchronously within 24 hours. Not suitable for chat or real-time agents. Best for: nightly eval runs, bulk content generation, embedding generation for ingestion. Architect the data pipeline to use batch for everything that can tolerate the SLA.

Should I commit to Provisioned Throughput Units (PTU) on Azure OpenAI? Only if your traffic is consistent enough to keep PTUs busy >60% of the time. PTUs lock in fixed capacity at fixed monthly rate; below ~60% utilisation, on-demand per-token billing is cheaper. PTU pricing favours teams with predictable steady-state load.

What's the cheapest way to host Llama 3.3 70B in 2026? Self-host on an H100 PCIe pair (~$15k capex, $20k/year fully loaded) if you have ops capacity and ≥10B tokens/year traffic. Otherwise: Groq Cloud ($0.59 input / $0.79 output per million) for high tokens-per-second; Together AI / Fireworks at similar rates with lower latency variance.

Are there free-tier API options worth using in production? Sparingly. Google Gemini 2.5 Flash has a generous free tier; OpenAI and Anthropic have small free credits but no ongoing free tier. Free tiers are good for prototyping and low-traffic personal projects, not production — rate limits and ToS make them inappropriate for paying customers.

What's the typical gross margin a hosted-API provider runs at? Estimated 60–80% on standard chat models, less on reasoning models (higher compute cost per request). Anthropic and OpenAI don't publish margins; estimates derive from token economics + GPU costs. Self-hosting captures most of that margin minus operational overhead — typically 30–60% net savings at scale.

How do I split costs across products in a multi-product company? Per-call tagging at the gateway: every LLM call carries a product_id, feature_id, user_id. Helicone, Langfuse, and Datadog all support this. Roll up by tag for chargeback / showback. Decide upfront whether to charge customers directly (showback) or absorb as overhead (chargeback to product P&L).

What's the right size of a finance/FinOps team for LLM cost management? At $1–5M annual LLM spend: 0.25 FTE of a product engineer's time on weekly review. At $5–50M: a dedicated FinOps analyst. At $50M+: a small team (2–4 people) including a FinOps lead and observability engineers.

Does using cheaper models lower my per-customer cost or my margins? Both, depending on your pricing. If your contract is fixed-price per customer, cheaper models improve margins directly. If you charge per-call or per-token, cheaper models let you offer lower prices to win customers. Most production teams split: pass some savings to customers, keep some as margin improvement.

How does prompt caching interact with structured outputs? Cleanly. Caching applies to input tokens; structured outputs apply to output tokens. The two are independent. A common pattern: stable system prompt with output schema is fully cacheable; only the user-input portion varies per call. Cache hit rate stays high; structured output enforcement preserves the response format.

What's the API cost of OpenAI Realtime per minute of voice? Approximately $0.06–0.15 per minute of audio input (depending on voice and tier) and $0.24–$0.40 per minute of audio output. Text portions priced at GPT-4o rates. For a typical 1-minute voice exchange (30 sec user, 30 sec assistant): $0.20–0.50 per minute. Verify on current pricing — Realtime pricing has shifted multiple times.

Should I use Anthropic Workbench Bedrock or direct Anthropic API? Direct Anthropic API: lowest cost, fastest feature access, simpler billing. Bedrock: AWS billing integration, IAM-based auth, multi-region failover, regional data hosting. Choose direct for cost; Bedrock for AWS-native enterprise contexts.

What's the cost difference between SOC-2-compliant and standard endpoints? Provider-dependent. OpenAI Enterprise: typically no premium for SOC-2 once enterprise contract is in place. Anthropic: SOC-2 included with all paid plans. Bedrock / Azure OpenAI: SOC-2 inherited from the cloud provider's compliance.

How often do hosted-API prices change? Frontier model prices drop 50–80% over the 12 months after launch as the provider optimises. New model launches often re-anchor prices. Multi-year API contracts at fixed pricing are usually a bad deal for the customer; rolling 6-month contracts capture price drops.

What's the impact of speculative decoding on per-token cost? Speculative decoding accelerates generation by 1.5–3× at a small extra compute cost. Per-token cost drops roughly in line. Most production serving stacks (vLLM, SGLang, TensorRT-LLM) support it; benefit is biggest for long-output decode workloads.

How do agentic workflows change cost economics? Agent runs make 5–50 LLM calls per task. A "single user query" can balloon to $0.50–$5 in API cost. Budget per-task, not per-call. Cap step counts; cache intermediate state; route step-by-step (e.g., simple subtasks on cheap models, hard subtasks on expensive ones).

What's the cost of running a 24/7 voice agent with OpenAI Realtime? At $0.30/minute of conversation and 24×60=1440 minutes/day, a single concurrent voice agent costs ~$432/day = $158k/year. For multi-tenant SaaS, multiplex carefully or use cheaper cascaded pipelines ($0.10/minute) for cost-sensitive use cases.

Are there cost optimisations specific to RAG workloads? Yes. (1) Rerank retrieved chunks down to 3–5; don't stuff 20. (2) Cache the system prompt + few-shot context across calls. (3) Embed at the cheaper provider (DeepInfra, Voyage); generate at the quality provider. (4) Use semantic caching for near-duplicate queries — saves 20–40% on repeat workloads.

How do I forecast LLM spend 12 months out? Anchor to traffic projections × current cost-per-call × expected mix. Assume price compression: frontier prices drop 30–50% on 12-month horizon. Reserve 25% for unexpected workload growth and 10% for hidden costs. Revise quarterly.

What's the typical breakdown of LLM spend by feature in a SaaS? Wildly variable. Common pattern at $5M/year spend: ~40% chat / conversational features, ~25% summarisation and content generation, ~20% RAG and search, ~10% agent workflows, ~5% embeddings. Reasoning-model spend dominates agent + research features once enabled.

Do "free" models on hosted platforms actually cost nothing? No. Most free tiers train on your data. Even when they don't, free tiers have rate limits that constrain production use. The "free" path is usable for prototyping and individual exploration; production economics never use free tiers at scale.

How fast are prices actually dropping? At the cheap-tier frontier (Gemini Flash, GPT-4o-mini, Haiku, DeepSeek V3), input prices dropped 8–12× from 2023 to 2026. Frontier prices (Opus, o3) dropped 2–4×. Open-weight on third-party hosts dropped 6–10×. Forward-looking budget: assume 3× drop in cheap-tier per year, 2× drop in frontier per year, through 2027. Reserve capacity beyond 1 year is rarely worth it.


Glossary

  • Active parameters — for MoE, the parameters that activate per token (vs total).
  • Batch API — non-realtime API tier offering ~50% discount for 24-hour-latency batch processing.
  • Capex / Opex — capital expense (buying hardware) vs operating expense (renting it).
  • Continuous batching — serving technique that dynamically merges requests of different lengths. The default in 2026.
  • Hosted API — paying a provider per token rather than running your own GPUs.
  • Per-million-token price — the standard pricing unit for hosted LLMs.
  • Prefill — the compute-bound first phase of inference; processes the input prompt.
  • Decode — the bandwidth-bound second phase of inference; generates output tokens one at a time.
  • Self-hosting — running models on your own GPUs.
  • TCO — total cost of ownership; includes hardware, electricity, ops, depreciation.
  • Utilization — fraction of hardware time spent doing useful work. Determines effective cost per token.

References


Per-provider 2026 pricing tear-down: every model, every tier

A model-by-model run-down of mid-2026 pricing across the major providers. Verify against the live pricing pages before committing.

OpenAI

  • GPT-5 (standard tier) — $5/M input, $15/M output. Cached input $0.50/M (10% of input price). 400k context.
  • GPT-5 long-context (>400k input) — $10/M input, $30/M output. Cached input $1/M.
  • GPT-5-mini — $0.50/M input, $1.50/M output. Cached input $0.05/M.
  • GPT-5-nano — $0.10/M input, $0.40/M output.
  • o3 reasoning — pricing on a per-thinking-token basis; about $2/M input, $8/M output, with thinking tokens billed at output rate. Cost amplification 10–100× over a chat call on hard problems.
  • o4-mini — cheaper reasoning model; about $1.10/M input, $4.40/M output.
  • gpt-4o-realtime / Realtime API — separate audio token meters: ~$40/M audio input, ~$80/M audio output, with text portions at GPT-4o rates. Audio cached input discounted.
  • Batch API — 50% off input and output, 24-hour completion SLA.

Anthropic

  • Claude Opus 4.x — $15/M input, $75/M output. Prompt caching: 5-minute cache write $18.75/M, 5-minute cache hit $1.50/M (10%); 1-hour cache write $30/M, hit $1.50/M.
  • Claude Sonnet 4.6 — $3/M input, $15/M output. Cache write $3.75/M (5m) or $6/M (1h), hit $0.30/M.
  • Claude Haiku 4.5 — $1/M input, $5/M output. Cache write $1.25/M (5m) or $2/M (1h), hit $0.10/M.
  • Batch API — 50% off.
  • Extended thinking — billed as output tokens; configurable thinking_budget parameter caps spend.

Google Gemini

  • Gemini 2.5 Pro — $1.25/M input (≤200k context), $5/M output. Long context (>200k input): $2.50/M input, $10/M output. Cached input ~$0.31/M.
  • Gemini 2.5 Flash — $0.075/M input, $0.30/M output. Cached input ~$0.019/M.
  • Gemini 2.5 Flash-Lite — even cheaper; $0.038/M input, $0.15/M output (approximate).
  • Gemini Deep Think — extended reasoning tier; verify current pricing.
  • Live API (audio) — separate audio token meters.
  • Batch API — 50% off.

Mistral

  • Mistral Large 2 — $2/M input, $6/M output.
  • Codestral — $0.20/M input, $0.60/M output.
  • Mistral Saba (regional) — pricing varies by region.
  • Smaller open-weight — Mistral 7B, Mixtral 8x7B, Mixtral 8x22B via Mistral API and 3rd-party hosts.

Cohere

  • Command-R+ — $2.50/M input, $10/M output (latest version).
  • Command-R — $0.15/M input, $0.60/M output.
  • Rerank — usage-based pricing; integrates with RAG pipelines.

xAI Grok

  • Grok-3 — $3/M input, $15/M output (approximate).
  • Grok-3-mini — $0.30/M input, $0.50/M output.

DeepSeek

  • DeepSeek-V3.5 — $0.27/M input, $1.10/M output (official API). Off-peak (UTC 16:30–00:30) pricing tier ~50% discount.
  • DeepSeek-R1 — reasoning model; $0.55/M input, $2.19/M output.
  • Western-hosted alternatives: Together AI, Fireworks, DeepInfra — comparable pricing with regional data hosting.

Open-weight model hosting

  • Together AI — Llama 3.3 70B at ~$0.88/M input/output (combined). Llama 3.1 405B at ~$3.50/M.
  • Fireworks — Llama models at similar pricing; quantised variants at lower rates.
  • Groq — Llama 3.3 70B at $0.59/M input, $0.79/M output. Strong on tokens/sec.
  • AWS Bedrock — Llama, Claude, Mistral hosting with AWS pricing layer. Provisioned throughput option for committed capacity.
  • Azure OpenAI — GPT models with Azure-specific pricing; PTU (Provisioned Throughput Units) for committed capacity.
  • NVIDIA NIM (NVIDIA AI Foundation) — Llama Nemotron and other models on NVIDIA-hosted endpoints.

Quick comparison: 1M tokens of mixed traffic

For a mix of 700k input + 300k output (roughly typical chat):

Model Cost per call set
GPT-5 standard $3.50 + $4.50 = $8.00
GPT-5-mini $0.35 + $0.45 = $0.80
Claude Opus 4.x $10.50 + $22.50 = $33.00
Claude Sonnet 4.6 $2.10 + $4.50 = $6.60
Claude Haiku 4.5 $0.70 + $1.50 = $2.20
Gemini 2.5 Pro $0.875 + $1.50 = $2.375
Gemini 2.5 Flash $0.0525 + $0.09 = $0.143
DeepSeek V3.5 $0.189 + $0.33 = $0.519
Llama 3.3 70B (Groq) $0.413 + $0.237 = $0.65

Cheapest frontier: Gemini Flash. Cheapest at premium quality: Sonnet 4.6 / Gemini Pro. Most expensive: Claude Opus, justified for the hardest tasks.


Reasoning-token deep math: when 25× hides in plain sight

Reasoning models charge for hidden thinking tokens at the output rate. The math is unforgiving.

Example: a hard math problem with o3

  • Visible response: 150 tokens.
  • Hidden thinking: 4,500 tokens.
  • Total output tokens billed: 4,650.
  • Per-request output cost at $8/M: $0.037.
  • Same problem on GPT-5 chat: ~300 tokens output × $15/M = $0.0045.
  • Reasoning premium: ~8.2× for this single request.

Example: deep research session with extended thinking

  • Visible response: 2,000 tokens.
  • Hidden thinking: 30,000 tokens.
  • Output billed: 32,000 tokens.
  • Cost at Claude Opus thinking-output rate ($75/M): $2.40.
  • Same task as a chat session (~3000 output tokens): $0.225.
  • Reasoning premium: ~11× for this session.

Budget guardrails

Every reasoning-model API exposes a budget parameter:

  • OpenAIreasoning_effort: low | medium | high. Low caps at ~512–2k thinking tokens; high allows 10k+.
  • Anthropicthinking_budget in tokens. Set explicitly; default is generous.
  • Google — Deep Think exposes similar budgeting in the Vertex AI configuration.
  • DeepSeek — R1 exposes a max_thinking_tokens analogue.

Production wisdom: route by query type. Easy queries → chat model. Hard queries → reasoning model with effort=medium. Very hard queries (mathematical proofs, complex planning) → reasoning model with effort=high. Without routing, cost balloons unpredictably.

Quality-cost frontier for reasoning

Reasoning effort Typical thinking tokens Cost amplification Quality gain vs chat
None (chat model) 0 baseline
Low 500–2k 3–10× +5–15% on hard tasks
Medium 2k–8k 10–30× +10–25% on hard tasks
High 8k–40k 30–100× +15–35% on hard tasks

Curves are steep at the top: low-effort reasoning captures most of the benefit at a fraction of the cost. Reserve high-effort for tasks where the marginal quality matters and budget allows.


Prompt caching deep dive: OpenAI vs Anthropic vs Google

Prompt caching is the single highest-leverage cost optimisation when you reuse long prompts. The three providers implement it differently.

OpenAI prompt caching

  • Trigger: automatic for prompts where the prefix matches a previous request from the same organisation within ~5–10 minutes.
  • Granularity: 1024-token blocks; the cached prefix grows in 128-token increments.
  • Discount: cached input tokens cost ~10% of the standard rate.
  • TTL: ~5–10 minutes, refreshed on reuse. No persistent cache.
  • Configuration: zero — automatic.
  • Limitations: no explicit hit/miss feedback; can't pin or pre-warm.

Anthropic prompt caching

  • Trigger: explicit cache_control: { "type": "ephemeral" } markers on prompt blocks.
  • Tiers: 5-minute cache (default) and 1-hour cache (premium price for write).
  • Discount: cached read at 10% of normal input price; cache write at 1.25× normal (5m) or 2× normal (1h).
  • Granularity: per-block; you control exactly which prompt sections cache.
  • TTL: 5 minutes (refreshable) or 1 hour.
  • Configuration: explicit; cacheable section markers in the request.
  • Benefit: most explicit control; aggressive cache hit rates possible.

Google Gemini caching

  • Implicit caching: automatic; similar to OpenAI's approach.
  • Explicit caching (Vertex AI): create a CachedContent resource with a TTL of minutes to hours; reference it in requests.
  • Discount: ~25% of normal input rate for cached tokens.
  • TTL: configurable up to several hours.
  • Configuration: explicit caches are first-class objects in Vertex AI.

Caching cost example

A 50k-token system prompt + RAG context reused across 1000 daily calls:

  • Without caching, Sonnet 4.6: 50k × 1000 × $3/M = $150/day.
  • With Anthropic caching (5m, average 5 cache hits per write): writes 200 × $3.75/M × 50k = $37.50/day for writes + 800 × $0.30/M × 50k = $12/day for reads = $49.50/day total. Saves ~67%.
  • With Anthropic caching (1h, average 50 cache hits per write): writes 20 × $6/M × 50k = $6/day for writes + 980 × $0.30/M × 50k = $14.70/day for reads = $20.70/day total. Saves ~86%.

For repeated prompts, caching cuts cost 4–10×. The strategy: design prompts with the stable prefix at the top so as much as possible caches.

When caching doesn't help

  • One-off prompts that don't repeat: no cache to hit.
  • Highly variable prompts: prefix doesn't match across calls.
  • Tiny prompts: caching overhead doesn't pay off below ~1k tokens.
  • Workloads where the input is always different (each user has their own document): cache misses dominate.

Self-host break-even: B200 vs H200 worked example

Worked example for self-hosting a 70B model at scale.

Hardware capex

  • 8× H200 SXM HGX node (~$280k capex; $25–35k per GPU plus chassis). 5-year depreciation: $56k/year.
  • 8× B200 SXM HGX node (~$350k capex; $40–50k per GPU plus chassis). 5-year depreciation: $70k/year.
  • Power and cooling: 14 kW per node × 8760 hours × $0.10/kWh = $12.3k/year.
  • Rack space, networking, security: ~$15k/year colo.
  • Ops fraction: 0.3 FTE platform engineer × $250k loaded = $75k/year (or pro-rated higher for first node, lower past 10 nodes).

Total fully-loaded annual cost for one 8-GPU H200 node: ~$158k/year. For B200: ~$172k/year.

Throughput

Llama 3.3 70B in FP8:

  • H200 8-GPU: ~5k QPS sustained (with continuous batching, 30% mean batch size 16, 200 in / 300 out per request).
  • B200 8-GPU: ~12k QPS sustained on same workload.

Tokens per year

  • H200: 5k QPS × 86400 sec × 365 days × 500 tokens/request × 50% utilisation = ~3.95 × 10^13 tokens/year ≈ 39.5 trillion tokens.
  • B200: ~94 trillion tokens.

Cost per million tokens

  • H200: $158k / (39.5 × 10^6 M tokens) = $0.004/M tokens.
  • B200: $172k / (94 × 10^6 M tokens) = $0.0018/M tokens.

At 50% utilisation, self-hosting on H200 hits ~$0.004/M tokens — over 1000× cheaper than Sonnet 4.6 at $3/M input. The catch: actual utilisation in production is rarely 50%; effective cost is 2–5× higher because of underutilised hours.

Realistic self-host cost at 25% utilisation

  • H200 effective: $0.008/M tokens.
  • B200 effective: $0.0036/M tokens.

Still dramatically cheaper than API. Self-hosting wins on raw cost at scale.

Break-even traffic

At Sonnet 4.6 pricing of $3/M input + $15/M output (average $8/M mixed), self-hosting an H200 node at $158k/year breaks even at $158k / $8 = ~20 billion tokens/year (about 55M tokens/day) at API pricing. Below that traffic, the API is cheaper after operational overhead.

The 70B model on H200 node serves ~40-100B tokens/year achievable capacity. So self-host wins economically above 20B tokens/year and remains a good fit up to 100B before adding nodes.


Hidden cost catalogue: egress, observability, retries, eval

A complete catalogue of costs that don't show on the pricing page.

Network egress

Hyperscaler egress for AI workloads: $0.05–$0.12/GB. For 100M API calls/day at 5kB request + 10kB response = 1.5 TB/day = $50–180/day in egress. Annualised: $18–66k.

Observability and tracing

LangSmith, Helicone, Langfuse, Datadog LLM Observability — typically $0.0001–$0.001 per logged request. For 100M req/day: $10k–100k/day. Most teams sample; even 10% sampling can run $30–300k/year.

Eval and regression testing

Running a 1000-example eval suite weekly against 3 candidate models: 3000 × 4 calls = 12,000 expensive calls × $0.01 = $120/week × 52 = $6k/year. Bigger eval suites scale linearly.

Guardrail layer

Pre-LLM content classifier (e.g., Llama Guard, OpenAI Moderation, custom): $0.0001–$0.001 per request. At 100M req/day: $10k–100k/day. Often consolidated with a small classifier model self-hosted on cheap hardware ($1–10k/day at scale).

Retry and fallback

Failed requests retried up to N times consume N× the tokens. Production retry rates of 1–5% are typical; cost overhead 1–5%.

Vendor lock-in

Migrating between providers (OpenAI → Anthropic, etc.) costs eval, prompt re-engineering, and parallel running during cutover. Budget 2–6 engineer-weeks per major migration.

Compliance and audit

SOC 2, ISO 27001, HIPAA: $50–300k/year additional for AI-specific scope. PII redaction layer: $0.0001 per token at scale via specialised services or $50k+ to build internally.

Total hidden cost as a percentage

For a typical SaaS at $5M/year API spend, hidden costs add 10–25% on top: $500k–$1.25M. Plan for it.


Model routing for cost: which router pattern saves what

Routing requests across multiple models by query complexity is the highest-impact cost optimisation after caching.

Patterns

  • Difficulty classifier: a tiny model (or rule) classifies each query as easy / medium / hard; routes to GPT-5-nano / Sonnet 4.6 / Opus 4.x. Saves 50–80% with quality preserved on most workloads.
  • Provider arbitrage: route same-quality models by current pricing or latency. Saves 10–30%.
  • Cascade: try cheap model first; if confidence is low, escalate to expensive model. Saves 60–85% but adds latency on escalations.
  • Skill-based routing: route by query domain (coding → DeepSeek-Coder or Codestral; math → reasoning model; chat → Haiku). Saves 40–70%.

Open-source routers

  • OpenRouter — gateway with per-call routing; charges a small markup for the abstraction.
  • LiteLLM — open-source proxy with provider abstraction and routing rules.
  • Portkey — gateway with semantic caching and routing.
  • Martian — model router with cost-quality optimisation.

Quality controls

Routing without eval drift detection is dangerous. Maintain a holdout eval set; sample 1% of production traffic for shadow runs on alternative routes; alert on quality drops.

Worked example

A SaaS receiving 100M queries/day, where 70% are easy chat queries, 25% are medium analysis, 5% are hard reasoning:

  • No routing (all on Sonnet 4.6): 100M × $0.005 mean per call = $500k/day.
  • With routing (70% Flash, 25% Sonnet, 5% Opus): 70M × $0.0005 + 25M × $0.005 + 5M × $0.05 = $35k + $125k + $250k = $410k/day. Saves $90k/day or 18%.
  • Aggressive routing (70% Flash, 25% Haiku, 5% reasoning): 70M × $0.0005 + 25M × $0.002 + 5M × $0.05 = $35k + $50k + $250k = $335k/day. Saves $165k/day or 33%.

Saving 33% on a $500k/day spend is $60M/year. Routing earns its keep.


Long-context economics: when KV cache dominates

For very long contexts, the KV cache becomes the dominant cost driver — sometimes exceeding the model parameters themselves.

KV cache size formula

For a transformer with L layers, hidden size H, and N attention heads of dimension D = H/N:

KV size per token = 2 (K and V) × L × H × bytes_per_element

For Llama 3.3 70B (L=80, H=8192, FP16): 2 × 80 × 8192 × 2 = 2.6 MB per token. At 128k context: 333 GB. At 1M context: 2.6 TB.

Cost implications

KV cache lives in GPU VRAM. A 70B model at 128k context fills an H100 80GB entirely with KV cache for a single request. Concurrent users multiply: 100 users at 128k context each needs 100 × 333 GB = 33 TB of KV cache — impossible on a single node.

In practice:

  • GQA / MQA: Llama 3 uses Grouped-Query Attention with 8 KV heads vs 64 Q heads, cutting KV cache 8×. At 128k context: 42 GB per request.
  • Quantised KV: INT8 KV cache halves memory; INT4 KV quarters it with some quality loss.
  • PagedAttention / continuous batching: shares KV pages across requests when contexts overlap; reduces effective per-request KV.

Long-context pricing

Long-context tiers (>200k tokens on Gemini, >400k on GPT-5) charge 2× input rate because the serving cost is dominated by KV cache, not parameter compute. Justified by the math.

Long-context cost worked example

Reading a 500k-token document on Gemini 2.5 Pro:

  • Long-context input: 500k × $2.50/M = $1.25.
  • Output of 5k tokens: 5k × $10/M = $0.05.
  • Total per query: ~$1.30.

With caching (TTL 1 hour, 10 queries per document):

  • First query: $1.25 input + $0.05 output = $1.30.
  • Subsequent 9 queries: 500k × $0.625/M (cached) + 5k × $10/M = $0.36 each.
  • Total over 10 queries: $1.30 + 9 × $0.36 = $4.54.
  • Without caching: 10 × $1.30 = $13.00.
  • Saving: $8.46 (65%).

Long contexts amplify the value of caching dramatically. Production teams working with long contexts cache aggressively.

KV-cache-aware routing

Some platforms route by current KV cache state — if a user's context is already cached on a particular GPU, route subsequent calls to that GPU. Significantly improves cache hit rate; reduces effective cost.


Spot-vs-on-demand market in 2026

GPU spot pricing reflects supply-demand for compute. Knowing the market helps capacity planning.

Hyperscaler spot vs on-demand discount

  • AWS Spot for H100 (P5): 60–80% off on-demand. Eviction risk: 5–20%/month.
  • Azure Spot for H100/H200: 50–75% off. Eviction notice typically 30 seconds.
  • GCP Spot VMs for H100/H200: 60–80% off. Less predictable supply.

Specialised cloud spot

  • CoreWeave preemptible: 40–60% off; eviction depends on enterprise demand.
  • Lambda On-Demand vs Reserved: reserved 30–40% off on-demand; no formal spot tier.
  • Crusoe spot-like tiers: 30–50% off; tied to stranded power availability.

Spot economics

For batch workloads (overnight eval, training jobs that can checkpoint), spot saves 60–80%. For latency-sensitive serving, eviction risk is incompatible with SLAs unless paired with on-demand fallback.

The 2026 supply-demand cycle

H100 supply tightened severely in 2024, normalised through 2025, and is now adequate in mid-2026. B200 supply tightened from 2024 launch through mid-2025, easing by Q2 2026. H200 has been steadily available throughout. Pricing trajectories: H100 on-demand has fallen ~25% from 2024 peaks; H200 has fallen ~15%; B200 has fallen ~10% from initial pricing.

Expect continued price compression as B300 ramps and Rubin launches. Reserve commitments should hedge against price drops — don't lock in 5-year deals at 2026 prices when 2027 prices will likely be 20–35% lower.


Inference cost benchmarks: BENCH vs REAL prices

Benchmark headline tokens/sec figures vs real production throughput diverge widely.

Why benchmarks overstate

  • Benchmark prompts are typically short (128–512 tokens). Production prompts run 1k–10k.
  • Benchmark batch sizes are tuned (16, 32, 64). Production batches are mixed-length.
  • Benchmark caches are warm. Production has prefix-cache miss patterns.
  • Benchmark hardware is dedicated. Production may be multi-tenant.

Typical adjustment factor

Bench → real: divide benchmark throughput by 1.5–3×.

Benchmark claim Realistic production
Llama 70B on H100 at 100 tok/s/req 30–60 tok/s/req
B200 at 200 tok/s/req 80–130 tok/s/req
Groq LPU at 500 tok/s/req 300–450 tok/s/req (cleaner divergence)

Real benchmarks to trust

  • Artificial Analysis (artificialanalysis.ai) — independent throughput / quality / price benchmarks updated weekly across hosted APIs.
  • OpenLLM-Leaderboard pricing-aware variants.
  • Helicone / OpenRouter usage data — aggregated real production cost-per-token.

Use bench numbers as ceilings, not as plan inputs

When planning capacity, divide vendor-published numbers by ~2 and ensure the result still meets SLO. Build in headroom for hot periods.


Reasoning-effort budget: ROI optimisation

Reasoning models reward query-level effort tuning. The decision: how much thinking budget to spend per query.

The quality-cost curve

Most reasoning models show diminishing returns past a threshold. Past 8k thinking tokens, additional thinking gives 2–5% incremental quality at 3× cost.

Optimisation strategies

  • Static low: default to reasoning_effort=low. Use case: queries that mostly don't need reasoning.
  • Dynamic by query: classifier sets effort per query. Use case: mixed workload; most queries cheap, hard ones get more budget.
  • Iterative escalation: try low; if confidence low, retry medium; escalate to high if needed. Use case: latency-tolerant workloads where cost matters most.
  • Hard ceiling: cap thinking at N tokens; let model fail gracefully if it can't finish. Use case: when failures are recoverable.

Worked example

A coding assistant where 80% of queries are simple (autocomplete, short edits) and 20% are complex (multi-file refactors, debugging).

  • No reasoning anywhere: 100% pass-rate 70% (chat model). Cost per query: $0.001.
  • All queries on reasoning effort=high: pass-rate 90%. Cost per query: $0.10.
  • Routed: 80% chat, 20% reasoning medium: pass-rate 84%. Cost per query: $0.015.

Routing captures most of the quality lift at a fraction of the cost. The breakdown to track: per-class pass-rate, per-class cost, blended cost vs blended quality.

Per-query reasoning budget table

A reference for setting reasoning_effort or thinking_budget per query class:

Query class Recommended effort Expected thinking tokens Per-query cost (o3-class)
Trivial fact lookup none (use chat) 0 $0.0001
Standard coding question low 500–1500 $0.005
Multi-step math medium 2k–6k $0.02
Complex debugging high 8k–20k $0.08
Multi-document research high 15k–40k $0.20
Proof or hard planning high (capped) 20k–50k $0.30

The table is for orientation; tune per your actual quality requirements and model.