Open Weights: The Ultimate Guide (2026 Edition)
Everything you need to know about open-weight LLMs in 2026 — what 'open' actually means, the license taxonomy, the 2026 frontier roster (DeepSeek V4, Qwen 3.6, GLM-5.1, Kimi K2.6, Llama 4, Mistral, Gemma 3, MiniMax M2.7), the China-vs-US openness gap, how to choose between closed APIs and self-hosted weights, serving stacks (vLLM, SGLang, TensorRT-LLM), fine-tuning and distillation, cost economics, license compliance, and the strategic risks.
In 2023 the open-weight story was Llama 2: a permissive but not-quite-Apache release from Meta that gave the rest of the industry something credible to deploy when GPT-4 felt too risky or too expensive. In 2024 it was Mistral, Mixtral, and the brief Llama 3 dominance. In 2025 it became unambiguous that the open-weight frontier had moved to China — DeepSeek V3 and V3.1 were genuinely competitive with closed flagships at a fraction of the inference cost; Qwen 2.5 and 3 dominated the small-model leaderboards; GLM-4.5 and Kimi K2 showed long-context and tool-use parity. In 2026 the gap has closed further: DeepSeek V4, Qwen 3.6 Plus, GLM-5.1, Kimi K2.6, MiniMax M2.7 are all open-weight and benchmark within a few points of Claude Opus 4.7 and GPT-5.5 on most public evals — at 1/10th to 1/30th the per-token cost when self-hosted. Meanwhile US frontier labs (OpenAI, Anthropic, Google DeepMind) remain closed, with Meta's Llama 4 and Google's Gemma 3 as the headline US open releases — both behind the Chinese frontier by a half-generation.
The take: in 2026, "should I use open weights?" is no longer a values question — it's a unit economics and control question. For agent workloads, RAG, classification, and most enterprise use cases, a self-hosted Qwen 3.6 or DeepSeek V3.2 on a Hopper-class GPU will outperform GPT-4o-class APIs on cost per token by 10-30x while matching quality. For frontier reasoning tasks (HLE, FrontierMath, GDPval), closed flagships still win — but the gap is months not years. The decision framework is: closed APIs when you need the absolute best on hard reasoning or when you can't operate inference infrastructure; open weights when you need cost control, data residency, fine-tuning, or genuine privacy. This guide is the map: what "open weights" really means (it is not "open source"), the 2026 license taxonomy you need to read before deploying, the actual frontier roster ranked by capability and licence permissiveness, the serving stacks (vLLM, SGLang, TensorRT-LLM, MLX), fine-tuning patterns, the cost math, and the risks that aren't on the marketing pages.
Companion reading: vLLM PagedAttention for the serving runtime that almost every open-weight deployment lands on, KV cache math for the memory budget that decides whether you can host a given model, FP8 training tradeoffs for the precision regime that the latest open weights ship in, MoE serving for the architecture pattern that dominates 2026 open weights (DeepSeek, Qwen, GLM, Kimi all MoE), quantization tradeoffs for the INT4/INT8 paths that make consumer-GPU hosting work, post-training RLHF/DPO for fine-tuning, synthetic data and distillation for the closed → open knowledge transfer that underwrites most "open" frontier models, AI inference cost economics for the full TCO math, production safety guardrails for the licence-and-jailbreak layer you have to add around any open-weight deployment, and the agent protocols guide for the MCP/A2A layer that sits on top.
Table of contents
- Key takeaways
- What "open weights" actually means
- The openness ladder: API → weights → weights+code → weights+code+data
- The 2026 license taxonomy
- The 2026 open-weight frontier (the roster)
- Why China dominates open weights in 2026
- How to choose: closed API vs open weights vs hybrid
- Serving open weights: the stack
- Hardware sizing: what runs where
- Quantization regimes and their tradeoffs
- Fine-tuning: LoRA, QLoRA, full FT, RFT
- Distillation: closed → open knowledge transfer
- The cost math: hosted-inference-of-OW vs self-hosted vs API
- Inference providers for open weights (Together, Fireworks, DeepInfra, Cerebras, Groq)
- Bittensor and the decentralized networks built on open weights
- A closer look: the Chinese open-weight labs
- Provenance, watermarking, and licence compliance
- Where open weights match or beat closed (benchmark reality check)
- Where open weights still lose
- Strategic risks: licence rug pulls, deprecation, security audits
- Geopolitics: export controls, sanctions, and the openness backlash
- The "open-source AI" definitional fight (OSI, Llama-as-not-open)
- Patterns that work in 2026 production
- Patterns that don't (anti-patterns)
- 2026 → 2027 outlook
- Further reading
Key takeaways
- "Open weights" is not "open source". You get the trained-model parameters and usually the inference code, but rarely the training data, training scripts, or filtered datasets. The OSI's 2024 Open Source AI Definition (OSAID 1.0) requires all four; almost no major release qualifies. This matters legally and reputationally; assume "open weights" by default.
- 2026 frontier open weights are within striking distance of closed flagships on most evals. DeepSeek V4, Qwen 3.6 Plus, GLM-5.1, Kimi K2.6, and MiniMax M2.7 trade blows with GPT-5.4-class models on GPQA, MMLU-Pro, SWE-Bench, and Arena ELO. The gap to absolute frontier (Opus 4.7, GPT-5.5, Gemini 3.1 Pro) is real on hard reasoning (HLE, FrontierMath, ARC-AGI-2), but ~6-12 months wide, not years.
- China runs the open-weight frontier. DeepSeek, Alibaba (Qwen), Z.ai (GLM), Moonshot (Kimi), MiniMax, Tencent (Hunyuan), Xiaomi (MiMo), ByteDance (Seed), Baidu (ERNIE), and the smaller InclusionAI / StepFun / iFlytek labs ship more frontier open weights per month than the rest of the world combined. The US "open" answer is Meta (Llama 4), Mistral (French but US-funded), Google (Gemma 3), and NVIDIA (Nemotron).
- The licence is half the decision. Apache 2.0 (Qwen, Mistral base, Gemma 3) and MIT (DeepSeek, GLM) are real open. Llama Community Licence and Llama 3 Acceptable Use Policy are open-with-restrictions. RAIL / OpenRAIL-M variants add use-case bans. Always grep the LICENSE file for "non-commercial", "research only", "competitive", and "shall not"; you'd be surprised what's hidden.
- Self-hosting open weights is 10-30× cheaper than equivalent closed APIs on per-token math — when you have steady throughput. Bursty workloads tip back toward APIs because of GPU idle time.
- The serving stack is dominated by vLLM, with SGLang gaining for agent workloads, TensorRT-LLM for NVIDIA-shop max-throughput, and MLX for Apple silicon. Almost nobody runs raw
transformersin production any more. - MoE dominates the 2026 frontier. DeepSeek V3.x/V4 (671B / 37B active), Qwen 3 (235B / 22B active), GLM-4.5/5 (355B / 32B active), Kimi K2 (1T / 32B active), MiniMax M2.7 (230B / 10B active) — all sparse. Active params drive latency; total params drive VRAM. Plan for both.
- Fine-tuning is mostly LoRA / QLoRA in 2026, not full FT. RFT (reinforcement fine-tuning) is the headline trend for reasoning models. Closed → open distillation is now the standard recipe for matching closed-model quality at lower inference cost.
- License rug pulls are real. Stability AI moved from CreativeML OpenRAIL-M to non-commercial subscription terms. Mistral split licences across model lines (Apache for "open" tier, MNPL for "research"). Always pin to a specific commit hash on HuggingFace; "the same model" can ship under different terms a year later.
- Watermarking / provenance is now an open-weight concern. SynthID-Text can be applied to any LLM logits at decode time (DeepMind released an open implementation); C2PA Content Credentials are emerging for image gen. The EU AI Act Article 50 makes detectable-AI-output a legal requirement from 2026.
What "open weights" actually means
The phrase has three common meanings, often used interchangeably and incorrectly:
1. Open weights (the actual common usage). The trained model parameters are downloadable, usually as safetensors files on Hugging Face. Inference code is typically also released (so you can actually load and run the model). Training data, training scripts, evaluation harnesses, and intermediate checkpoints are usually not released. Examples: Llama 4, DeepSeek V3.2, Qwen 3.6, GLM-5.1, Kimi K2.6, Mistral Large 2, Gemma 3.
2. Open source (the OSI / OSAID 1.0 definition, since Oct 2024). To qualify under the OSI's Open Source AI Definition, a model must provide: (a) trained-model weights, (b) source code sufficient to train an equivalent model, (c) detailed information on training data (sources, processing, lineage), and (d) the freedoms to use, study, modify, and share for any purpose. Almost no major frontier release meets this bar. OLMo (Allen AI) is the highest-profile genuine open-source AI release; Pythia (EleutherAI) historically; DCLM for fully reproducible pre-training. Llama, DeepSeek, Qwen, GLM, Kimi, Mistral, Gemma — none meet OSAID. They're open weights, not open source.
3. Open-access API. Some vendors call their hosted API "open" because you can pay-per-token instead of going through a sales process. This is not openness in any technical sense; it's just SaaS pricing. Ignore the marketing.
The practical impact: when you read "Meta open-sourced Llama 4," what they actually did was release the weights under a custom community licence. You cannot legally retrain it; you cannot prove what's in it; you have to follow the Acceptable Use Policy. That's much closer to "freely downloadable proprietary software" than to Linux. This distinction matters for:
- Legal review: open source has clear precedent; open weights with custom licences require per-licence analysis.
- Reproducibility: you can't verify safety claims without the training data.
- Forkability: you can fine-tune but not retrain from scratch in a different direction.
- Provenance: you have no way to audit what data the model memorized.
The openness ladder: API → weights → weights+code → weights+code+data
A more useful framework than the binary open/closed split:
| Level | What you get | Examples |
|---|---|---|
| 1. Closed API | Inference only, vendor-controlled | GPT-5.x, Claude 4.x, Gemini 3.x |
| 2. Open access API | Same as level 1, no sales process needed | OpenAI public tier, Anthropic public tier |
| 3. Hosted open weights | Run on a third-party endpoint, but underlying model is downloadable | DeepSeek V3.2 on Together, Qwen 3.6 on Fireworks |
| 4. Open weights | Weights downloadable; inference code provided | Llama 4, DeepSeek V3.2, Qwen 3.6, GLM-5.1 |
| 5. Open weights + training code | Weights + how-to-train scripts | Mixtral, OLMo-2 (partial), DCLM |
| 6. Open weights + training code + filtered data | All of above + the actual pre-training corpus | OLMo, Pythia, DCLM |
| 7. Truly reproducible | All of above + deterministic seed/build | Almost none at frontier scale |
Most "open weight" discussion conflates levels 3-5. Levels 6 and 7 are rare and mostly academic. When choosing a model, ask: what level do you actually need? For inference-only deployment, level 4 is enough. For fine-tuning, level 4-5. For safety auditing or research reproducibility, level 6-7.
The 2026 license taxonomy
The licence determines whether you can deploy commercially, redistribute, fine-tune, or even study the model. The 2026 landscape:
Permissive (true open, commercial-OK, derivatives-OK):
- Apache 2.0 — Mistral 7B / 8x7B / Large 2 (some), Qwen 2.5 base, Qwen 3 base, Gemma 3, Phi-4. Includes a patent grant. Compatible with most commercial use.
- MIT — DeepSeek V3 / V3.1 / V3.2 / V4, GLM-4 / 4.5 / 4.6 / 5.1. No patent grant but simpler.
- BSD-3-Clause — some research models.
Source-available with restrictions (community licences):
- Llama Community Licence (3.x, 4.x) — Meta's licence. Free for most commercial use, but requires >700M-MAU companies to request a separate licence. Includes Acceptable Use Policy (no weapons development, no CSAM, no critical infrastructure abuse). You must include the licence and "Built with Llama" attribution. Generally treated as commercial-OK for most users, but not OSI-approved.
- Gemma Terms of Use — similar shape: free commercial use, prohibited use clause, no special licence threshold for MAU but requires distributing licence and prohibited-use policy with derivatives.
- OpenRAIL-M / RAIL (BLOOM, some Stability releases) — adds enumerated use-case prohibitions (military, surveillance, etc.). Functionally enforces a usage policy via licence terms.
Non-commercial (research only):
- CC-BY-NC-4.0 — used by some smaller research labs (Hugging Face's older release lines, some academic labs).
- MNPL (Mistral Non-Production Licence) — applies to some "premier" Mistral releases that are downloadable but not commercially usable without a separate agreement.
- Custom non-commercial — frequent for image-gen models (SDXL Turbo originally, some Stability releases).
License triage rules I use:
- Open the actual
LICENSEorLICENSE.txtin the repo. Don't trust the README or marketing. - Search for: "non-commercial", "research only", "compete", "MAU", "monthly active", "shall not", "without prior written".
- Check the Acceptable Use Policy if linked — it often adds restrictions outside the licence text.
- Check the model card — sometimes a stricter AUP lives there.
- Pin to a commit hash on HF (
revision=...) so you can prove what licence you accepted.
Practical license map for the 2026 frontier:
| Model | Licence | Commercial? | Derivatives? | Notes |
|---|---|---|---|---|
| DeepSeek V3.2 / V4 | MIT | ✅ | ✅ | Cleanest. |
| Qwen 3.6 base (35B-A3B, 27B) | Apache 2.0 | ✅ | ✅ | Cleanest. |
| Qwen 3.6 Plus | API only, weights TBA | API: ✅ | n/a | Closed API tier; base variants are Apache. |
| GLM-5.1 | MIT | ✅ | ✅ | Cleanest. |
| Kimi K2.6 | Modified MIT | ✅ | ✅ | Restriction: >100M MAU + >$20M ARR must show "Kimi K2" attribution. |
| MiniMax M2.7 | MiniMax Non-Commercial Licence | ❌ | ✅ research | Commercial use requires separate agreement. |
| Llama 4 Maverick / Scout | Llama Community Licence 4 | ✅ (<700M MAU) | ✅ | AUP applies. |
| Mistral Large 2 | MNPL (Mistral Non-Production) | ❌ | ✅ research | Or pay for commercial. |
| Mistral 7B / 8x7B / Codestral | Apache 2.0 (most) | ✅ | ✅ | Codestral has a separate research licence; check per-file. |
| Gemma 3 (1B, 4B, 12B, 27B) | Gemma Terms of Use | ✅ | ✅ | AUP applies. |
| Phi-4 | MIT | ✅ | ✅ | Cleanest. |
| Nemotron 4 / 4.5 | NVIDIA Open Model Licence | ✅ | ✅ | Specific terms around derivatives. |
| OLMo / OLMo-2 | Apache 2.0 (model + data + code) | ✅ | ✅ | Genuine OSAID-compliant. |
Note the pattern: the cleanest licences (MIT, Apache) are on the Chinese frontier models. US frontier (Llama, Gemma) use custom community terms that are "almost-open." US research labs (OLMo, Phi) use true open licences.
The 2026 open-weight frontier (the roster)
Ranked roughly by frontier capability as of mid-2026 (May), with quick notes. Scores cited are public reports; for live data see /leaderboard/text and /leaderboard/code.
Tier S — within striking distance of closed flagships:
- DeepSeek V4 (DeepSeek, China). MoE successor to V3.2; announced April 2026. MIT-licensed weights forthcoming. Published specs vary by source (somewhere in 1-1.6T total, 40-50B active); the public HF repo
deepseek-ai/DeepSeek-V4is not yet downloadable at time of writing — treat capability claims as preview-grade until weights ship. Reported training cost in the ~$5-10M range, in line with the V3 efficiency story. - Qwen 3.6 Plus (Alibaba, China). 397B sparse + dense variants (35B-A3B, 27B). April 2026. Apache 2.0 on base, API-only on Plus tier. Best multilingual coverage; tool-use parity with closed.
- GLM-5.1 (Z.ai / Zhipu, China). ~750B sparse (256 routed experts + 1 shared, 8 active per token; hidden 6144, 78 layers; HF
createdAt: 2026-04-03). MIT. The GLM-5V Turbo variant adds native vision. Tops several coding benchmarks; OpenClaw-compatible agent harness scoring. - Kimi K2.6 (Moonshot, China). ~1T sparse, ~32B active (384 routed experts, 8 active per token; HF
createdAt: 2026-04-14). Modified MIT. Long-context flagship (256K). Natively multimodal (vision + text viaKimiK25ForConditionalGeneration). Reasoning variant K2.5-Thinking ships as a separate release. - MiniMax M2.7 (MiniMax, China). 230B sparse, 10B active. March 2026. Non-commercial weights. Open weights but not commercial-OK by default.
Tier A — strong, slightly behind:
- Llama 4 Maverick / Scout (Meta, US). 400B / 70B dense. April 2025. Llama Community Licence. The US open-weight flagship; behind the Chinese frontier by ~6 months on most evals.
- DeepSeek V3.2-Exp (DeepSeek). 671B MoE, 37B active. Sept 2025. MIT. Workhorse of the late-2025 / early-2026 open-weight scene (HF
createdAt: 2025-09-29). - Mistral Large 3 (Mistral). Late 2025. Mistral commercial / API.
- Hunyuan 3 / Hunyuan-Vision (Tencent, China). Mid 2025. Apache 2.0 on most variants.
- Gemma 3 27B (Google, US). March 2025. Gemma Terms. Multimodal. The strongest small open model; competitive with Llama 3 70B at much smaller scale.
Tier B — strong specialists / small models:
- Phi-4 (Microsoft, US). 14B dense. Late 2024 / early 2025. MIT. Reasoning-focused; punches above its weight.
- Nemotron 4 / 4.5 (NVIDIA). 340B / 70B dense and MoE variants. NVIDIA Open Model Licence. Strong on reasoning post-train.
- Qwen 3.6-27B / 35B-A3B (Alibaba). Apache 2.0. The best "small" open models in 2026.
- Ling-2.6 (InclusionAI / Ant Group). Apache 2.0. 1T MoE. April 2026.
- MiMo-V2.5 / V2.5-Pro (Xiaomi). MIT-style. Strong on the ClawEval / OpenClaw harness benchmarks.
- StepFun Step-3 — mid-tier sparse MoE; growing presence.
- Codestral / Devstral (Mistral) — code specialists, Apache.
Tier C — historical and small:
- Llama 3.1 405B / 3.3 70B — still widely deployed.
- Qwen 2.5 series.
- DeepSeek V2 / V2.5.
- BLOOM / BLOOMZ.
- Falcon series.
Specialist open weights:
- OLMo 2 32B (Allen AI) — true OSAID-compliant.
- DCLM (DataComp) — research-grade fully reproducible.
- MPT, Falcon, Pythia — historical, mostly academic now.
- BiologicalLLMs — ESM-3, AlphaFold, RoseTTAFold series.
- CodeGen / StarCoder2 — code specialists.
Why China dominates open weights in 2026
The pattern of every 2025-2026 monthly release is the same: a Chinese lab ships an open-weight model at near-frontier quality with a permissive licence; US frontier labs ship a closed product; Meta and Mistral catch up six months later. Why?
1. Domestic competition dynamics. Chinese labs are competing for domestic enterprise adoption (where API access to US models is unreliable) and for talent and prestige. Open-weight releases generate goodwill, recruit researchers, and bypass having to build global API infrastructure under uncertain export-control regimes.
2. The DeepSeek effect. DeepSeek V3's December 2024 release with an MIT licence and a reported ~$5.5M training cost reset expectations industry-wide. The follow-on releases (V3.1, V3.2, V4) compounded credibility. Other labs (Qwen, Z.ai, Moonshot, MiniMax) recognized that the only durable answer to DeepSeek was matching openness.
3. US export-control bypass. China-developed open weights can be deployed inside China without depending on US-controlled infrastructure (H100/H200 export bans, US cloud KYC). They can also be deployed outside China by anyone who can host them, sidestepping App-Store-style platform control.
4. Lower opportunity cost. US frontier labs (OpenAI, Anthropic, Google DeepMind) make billions in API revenue and have business reasons to keep weights closed. Chinese frontier labs make less from API and more from enterprise contracts and government compute leases — the marginal revenue lost by open-weighting is smaller.
5. Open-weight as a fund-raising signal. Several Chinese labs (DeepSeek, Moonshot, MiniMax) have used open-weight prestige to raise large rounds at frontier valuations, suggesting investors see open-weight credibility as a path to long-term enterprise position even without short-term API monetization.
6. Faster derivative ecosystem. The Chinese open-weight ecosystem now has end-to-end derivative chains: Qwen base → InclusionAI Ling fine-tunes → community RFT variants → ModelScope hosting → Tongyi / DingTalk integration. The US Llama ecosystem has the same shape but slower cadence.
Implications:
- Most "best open weight at quality X" answers in 2026 are Chinese models.
- Sovereign-AI initiatives in EU / Middle East / India increasingly start from Qwen or DeepSeek bases, not Llama.
- The "China can't reach frontier without H100s" thesis is empirically wrong; it has been reached using H800, A800, Huawei Ascend 910B, and software efficiency (DeepSeek's DualPipe, FP8 training, MoE).
This does not mean every team should switch to Chinese open weights. There are real geopolitical, supply-chain, and security-review reasons many enterprises won't. But pretending the frontier is closed-US is now factually wrong.
How to choose: closed API vs open weights vs hybrid
Use this decision tree:
1. Is your workload steady-state and high-volume (>100M tokens/day sustained)?
- Yes → self-hosted open weights almost always wins on cost.
- No → reconsider; API may be cheaper because of GPU idle time.
2. Do you need data residency, on-prem, or air-gapped deployment?
- Yes → open weights, no other option.
- No → API is fine.
3. Do you need to fine-tune?
- Yes → open weights (or use OpenAI / Anthropic fine-tuning APIs, but those have limits on customization and base-model choice).
- No → either.
4. Is your use case in the "closed-model strength zone"? (Frontier reasoning, long-context retrieval over millions of tokens, agentic tool-use chains >50 steps, complex math/code at the frontier.)
- Yes → closed APIs still lead by a measurable margin; pay the premium.
- No → open weights match or exceed.
5. Do you have ops capacity to run inference infrastructure?
- Yes → self-host (vLLM/SGLang).
- No → use a hosted-open-weight provider (Together, Fireworks, DeepInfra, Cerebras, Groq).
6. Are you in a regulated industry where you need to audit what's in the model?
- Yes → only OSAID-compliant releases (OLMo, DCLM) give you full data provenance. Otherwise treat any model as "auditable to the licence text, not the data."
7. Is your latency budget <100ms p50?
- Yes → Groq / Cerebras / SambaNova for open weights, or vendor-specific low-latency endpoints. vLLM standard deployment usually misses sub-100ms on frontier-class models.
The hybrid pattern that works: use a frontier closed API (Claude / GPT / Gemini) for the hardest 5-10% of calls, and a self-hosted open-weight model for the 90% that's classification, RAG retrieval, code generation, simple agents, or chat completion. Route via a thin classifier or rule-based heuristic. This combines cost economics of open weights with quality ceiling of closed.
Serving open weights: the stack
In 2026, almost every production deployment lands on one of:
vLLM — the default. PagedAttention, continuous batching, prefix caching, speculative decoding, FP8 / AWQ / GPTQ quantization, multi-LoRA serving, distributed serving via Ray. Best community support; supports almost every architecture released within weeks. Companion reading: vLLM PagedAttention.
SGLang — second-most-popular, gaining for agent workloads. RadixAttention (better prefix sharing than vLLM's), faster structured generation (regex/JSON-schema), strong on long-context. Same authors as the original LMQL. Pulls ahead for workloads with heavy prefix reuse (agents, multi-turn).
TensorRT-LLM — NVIDIA's path. Max throughput on NVIDIA hardware; complex deployment. Use when you're an all-NVIDIA shop with serious throughput needs and ops capacity.
MLX — Apple Silicon. Surprisingly good for local dev and small production deployments on Mac Studios. M3 Ultra 192GB can comfortably serve a 70B 4-bit model.
llama.cpp + GGUF — local / consumer. The reason laptop inference exists. Worth tracking even for production teams because the GGUF format and quantization research feed back into vLLM/SGLang.
ExLlamaV2 — niche, but the best path for max-throughput single-GPU inference of quantized models.
TGI (Hugging Face Text Generation Inference) — once dominant, now mostly legacy; HF themselves recommend vLLM for new deployments.
Friendli / Bento / Modal / RunPod — managed serving layers that wrap vLLM/SGLang and handle scaling. Use when you don't want to operate Kubernetes for inference.
Picking the runtime:
- Default to vLLM unless you have a specific reason not to.
- Move to SGLang if your workload is agentic with heavy prefix reuse.
- Use TensorRT-LLM only if you're an NVIDIA-only shop maxing throughput.
- Use MLX for dev / small-scale prod on Apple silicon.
- Use llama.cpp for local / edge.
Hardware sizing: what runs where
The honest cheat-sheet, for FP16/BF16 (full precision) inference of the most-deployed open weights. Halve for FP8, quarter for INT4.
| Model | Active params | Total params | Approx VRAM (BF16) | Min config | Comfortable config |
|---|---|---|---|---|---|
| Llama 3.1 8B | 8B | 8B | 16 GB | 1× A100 40GB | 1× H100 80GB |
| Qwen 3.6 27B | 27B | 27B | 54 GB | 1× H100 80GB | 2× H100 80GB |
| Llama 3.3 70B | 70B | 70B | 140 GB | 2× H100 | 4× H100 |
| DeepSeek V3.2 | 37B active | 671B | 1.3 TB | 8× H200 141GB | 8× B200 |
| Qwen 3.6-Plus / 397B | ~40B active | 397B | 800 GB | 8× H100 | 8× H200 |
| Kimi K2.6 | 32B active | 1T | 2 TB | 16× H100 | 8× B200 + NVLink |
| GLM-5.1 | 32B active | 754B | 1.5 TB | 8× H200 | 16× H200 |
| Mistral 7B | 7B | 7B | 14 GB | 1× A10 24GB | 1× L40S |
| Gemma 3 27B | 27B | 27B | 54 GB | 1× H100 | 2× L40S |
| Phi-4 | 14B | 14B | 28 GB | 1× L40S | 1× H100 |
A few non-obvious points:
- MoE total params drive VRAM, not active params. You need to fit all the experts even if you only activate a fraction. A 671B MoE with 37B active still needs 1.3TB of weight storage.
- KV cache adds substantially at long context — see KV cache math. For Kimi K2.6 at 256K context with batch size 8, KV cache alone can exceed 1TB.
- Speed scales with active params, not total. A 37B-active MoE serves at roughly the speed of a 37B dense model.
- FP8 halves VRAM with minimal quality loss on most modern open weights — many were trained natively in FP8 (DeepSeek V3.x, parts of Qwen 3).
- INT4 (AWQ, GPTQ, EXL2) quarters VRAM at small but measurable quality loss. Generally fine for 70B+ models, more painful at <13B.
Quantization regimes and their tradeoffs
Quantization is what makes consumer-GPU and Mac-Studio inference work, and it's also how production teams stretch a single H100 to serve a model that would otherwise need two.
FP16 / BF16 — full precision. Reference quality. Use as baseline.
FP8 (E4M3 / E5M2) — half the memory, near-zero quality loss on models trained with FP8 awareness (DeepSeek V3.x, parts of Qwen 3, Llama 4). Native on H100/H200/B200; emulated on older hardware. The 2026 default for new deployments.
INT8 — half memory, small quality loss. Useful when FP8 isn't available (consumer GPUs).
INT4 (AWQ, GPTQ, EXL2, NF4, IQ4) — quarter memory, 1-3% quality loss on large models, more on small. Multiple competing formats:
- AWQ (Activation-aware Weight Quantization) — best quality at INT4 for most modern transformers. vLLM-native.
- GPTQ — older but still widely used. Mostly being replaced by AWQ.
- EXL2 — best throughput on single-GPU consumer setups (ExLlamaV2 runtime).
- NF4 (BitsAndBytes) — used in QLoRA fine-tuning more than serving.
- IQ4 / IQ3 (k-quants) — llama.cpp / GGUF formats. Strong for CPU/Apple Silicon.
INT2 / INT3 / 1.58-bit — research / curiosity. BitNet b1.58 (Microsoft, 2024) showed near-FP16 quality at 1.58 bits if you train natively at that precision, but no major frontier model has shipped this in 2026.
Practical guidance:
- Default to FP8 for production on H100+ hardware. It's the sweet spot.
- Use AWQ INT4 when you need to fit a model on smaller hardware or run two models on one GPU.
- Avoid INT4 on <13B models if quality matters.
- Always benchmark on your actual eval — published "INT4 = -1% MMLU" claims don't always translate to your use case.
Fine-tuning: LoRA, QLoRA, full FT, RFT
In 2026, "fine-tuning open weights" is mostly:
LoRA (Low-Rank Adaptation) — the default. Train small low-rank matrices alongside frozen base weights. ~0.5% of the parameter count. Trains on a single H100 for most 70B models. Serves via multi-tenant LoRA at near-zero marginal cost per adapter.
QLoRA — LoRA on top of a 4-bit quantized base. Trains 70B models on a single consumer GPU (RTX 4090 / 6000 Ada). Most popular path for cost-conscious fine-tunes. Slightly more brittle than LoRA on FP16/BF16 base.
DoRA (Weight-Decomposed Low-Rank Adaptation) — LoRA variant; consistent ~0.5-1% quality bump over LoRA in published benchmarks.
Full FT — train all parameters. Required for large changes to model behavior (domain pre-training, new languages). Needs 4-8× the VRAM of inference. Mostly used at frontier labs and well-funded sovereign-AI projects.
Continued pre-training — adapt a base model to a new domain by running additional pre-training data through it (medical, legal, code). Doesn't change architecture, doesn't add new behaviors directly; sets up for stronger SFT/RLHF afterwards.
SFT (Supervised Fine-Tuning) — train on labeled completion examples. Standard for instruction-following, task-specific adaptation. Companion reading: Post-training: RLHF and DPO.
DPO (Direct Preference Optimization) — preference learning without a separate reward model. Simpler than RLHF, often sufficient. Standard for most open-weight post-training in 2025-2026.
RFT (Reinforcement Fine-Tuning) — the headline 2026 trend. RL on verifiable rewards (code passes tests, math answer correct). Used by DeepSeek R1, Qwen QwQ, Kimi K2 reasoning variants. Requires a verifiable reward signal; not applicable to all tasks. OpenAI's o-series and Anthropic's extended-thinking are closed-model analogs.
Distillation (see next section) — train a smaller model on outputs of a larger one. Often combined with SFT.
The practical recipe for most teams in 2026: take a strong open-weight base (Qwen 3.6-27B, Llama 3.3 70B, DeepSeek V3.2), apply QLoRA + DPO on your domain data, serve via vLLM with the LoRA hot-swapped. Total cost: a few hundred to a few thousand dollars for the training run; serving cost amortizes per-token.
Distillation: closed → open knowledge transfer
Almost every 2025-2026 frontier open-weight release used some form of distillation from closed-model outputs. The standard recipe:
- Generate large quantities of high-quality outputs from a strong closed model (GPT-4 / Claude 3.5+ / Gemini 2.5) on a diverse prompt distribution.
- Filter for quality (model-as-judge, rule-based, human review for a sample).
- SFT a smaller open base on this synthetic distillation set.
- Optional: layer DPO with preferences sourced similarly.
- Optional: RFT on verifiable subset.
This is how DeepSeek V3 hit GPT-4-class quality at much lower training cost; how Qwen and Llama derivatives improve on their base models; how InclusionAI's Ling series compete despite being a smaller lab.
Licence implications: most closed-model APIs (OpenAI, Anthropic, Google) prohibit using outputs to train competing models in their TOS. Enforcement is patchy; the practical observation is that everyone does it and the lawsuits have so far been narrow (the Bytedance / OpenAI ban in late 2023; the Anthropic / Quora-Poe dispute). Companion reading: Synthetic data and distillation.
Quality ceiling: distilled models can match the behavior of the source on common tasks but typically lag on the source's strongest capabilities (frontier reasoning, novel problem-solving). Distillation transfers procedural knowledge well; it transfers the deepest reasoning capacity poorly.
The cost math: hosted-inference-of-OW vs self-hosted vs API
The TCO comparison that actually matters in 2026. Sample numbers; your mileage will vary by region, contract terms, and utilization.
Closed-API GPT-5 class (Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro):
$3 per 1M input / $15 per 1M output ($5 blended at typical 5:1 input/output ratio)- Fully managed, no infra
- Latency: typically 1-3 seconds first token, 50-100 tok/s decode
Hosted open weights (Together, Fireworks, DeepInfra serving DeepSeek V3.2, Qwen 3.6, Kimi K2):
- ~$0.20-0.50 per 1M input / $0.50-2 per 1M output
- 5-10× cheaper than closed APIs at the same model class
- Same managed convenience; slightly worse latency in most cases
- Quality matches or exceeds GPT-4o / Claude Haiku class on most tasks
Self-hosted open weights (run DeepSeek V3.2 / Qwen 3.6 on your own H100/H200 cluster):
- Hardware cost: 8× H100 80GB ≈ $30k/month rented from CoreWeave/Crusoe, or ~$300k capex
- At ~5000 tokens/s sustained throughput from a well-tuned vLLM cluster
- Per-million-token cost at 70% utilization: ~$0.10-0.20
- 10-30× cheaper than closed APIs at full utilization
- Catastrophic on cost at <30% utilization (idle GPU)
The break-even math:
- Self-hosted breaks even with hosted-OW at ~30-50M tokens/day sustained.
- Self-hosted breaks even with closed APIs at ~5-10M tokens/day sustained.
- Hosted-OW always beats closed-API on per-token; the question is whether your other constraints (data residency, fine-tuning) push you further toward self-host.
Cost-control levers that move the numbers a lot:
- Prefix caching (KV cache reuse across calls) can cut input cost 10×+ on agent workloads. Companion: KV cache math.
- Speculative decoding can 2-3× throughput on appropriate workloads. Companion: speculative decoding.
- Batch APIs (OpenAI / Anthropic batch tier) — 50% discount for async, 24h SLA. Available for closed and increasingly for hosted-OW.
- Disaggregated inference (separate prefill from decode pools). Companion: disaggregated inference.
- MoE serving (only active experts touched). Companion: MoE serving.
Inference providers for open weights
The "hosted-OW" tier deserves its own treatment because most teams will land here before they ever self-host.
General-purpose (broad model selection, OpenAI-API-compatible):
- Together AI — broadest model menu; first to host most new Chinese releases; competitive pricing.
- Fireworks AI — fast, broad menu, strong FP8/INT8 support, custom fine-tunes.
- DeepInfra — cheapest list prices in most categories; less fine-tune support.
- OpenRouter — aggregator routing across the above plus closed APIs; one API key, many backends.
- Replicate — broad menu including image/video models; pay-per-second container model.
- Hyperbolic — newer entrant; strong on the latest Chinese open weights.
Hardware-specialized (faster latency, narrower menu):
- Groq — LPU silicon; sub-50ms first-token on supported models. Limited model menu (currently Llama, Qwen, GPT-OSS, Kimi, Whisper).
- Cerebras — wafer-scale; fastest output throughput available (1000+ tokens/s on Llama 70B class).
- SambaNova — RDU silicon; competitive on long-context and large models.
Vendor-affiliated:
- Anthropic API — Anthropic-hosted closed models; doesn't host open weights.
- OpenAI API — same.
- Google Vertex AI — hosts Gemini (closed), Gemma (open), and third-party open weights as a Marketplace offering.
- Azure AI — hosts OpenAI, Llama, Mistral, DeepSeek, others.
- AWS Bedrock — hosts Claude, Llama, Mistral, DeepSeek, Cohere, Stability, others.
- Alibaba Cloud Model Studio — strong on Qwen ecosystem; primary path for serving Qwen in Asia.
Picking:
- Default to Together or Fireworks for broad model menus.
- Use Groq or Cerebras when latency is the constraint.
- Use OpenRouter when you want one integration that fans out to many backends and routes by price/availability.
- Use the hyperscaler (AWS / Azure / GCP) when you need enterprise SLAs or already have spend commitments.
Bittensor and the decentralized networks built on open weights
An entire layer of the open-weight ecosystem sits in decentralized / onchain compute and inference networks. They exist because open weights exist — closed APIs can't be deployed on a permissionless network. The 2026 roster:
Bittensor (bittensor.com) — the biggest. A Layer-1 chain (finney) with ~100+ "subnets," each a separate market for a specific AI workload. Validators score miners on quality; miners run open-weight models (mostly Llama, Qwen, DeepSeek, Mistral variants) and earn TAO tokens proportional to their relative scores. Notable subnets:
- Subnet 1 (Apex) — text inference. Validators send prompts; miners reply; quality scored against a reference.
- Subnet 9 (Pretraining) — competitive pre-training. Miners train models; best loss wins block reward.
- Subnet 56 (Gradients) — fine-tuning marketplace.
- Subnet 19 (Inference Subnet) — large-scale open-weight inference.
- Subnet 18 (Cortex.t) — synthetic data generation.
- Subnet 27 (Compute) — raw GPU rental.
- Many more for vision, audio, code, agents, prediction markets. Every subnet requires open weights — closed APIs can't be permissionlessly evaluated by validators.
Akash Network — decentralized GPU compute marketplace. Common workload: run vLLM serving DeepSeek/Qwen/Llama on a tenant's H100 lease. Pay-per-block in AKT.
Ritual — onchain inference: smart contracts trigger model calls (typically open-weight Llama / Mistral / DeepSeek), with verifiable execution via zkML or optimistic verification.
Gensyn — decentralized training network. Submit a training job (typically continued pretraining or fine-tune of an open-weight base); the network distributes it across volunteer GPUs and verifies via probabilistic redundant computation.
Nous Research — Psyche / Hermes — decentralized pre-training of open-weight base models across geographically distributed clusters. Psyche framework released open-source 2024-2025; Hermes-Agent / Nous Hermes 3 series are popular open-weight derivatives of Llama.
Prime Intellect — decentralized training of frontier open weights (INTELLECT-1, INTELLECT-2). 10B-class models trained across multiple datacenters globally. Open-weight by construction.
io.net — decentralized GPU compute (3M+ GPUs claimed, mostly consumer); hosts open-weight inference services.
Hyperbolic — hybrid: centralized inference of open weights (DeepSeek, Qwen, Kimi) plus a decentralized GPU marketplace. Some of the lowest open-weight token prices in mid-2026.
Ora — onchain inference oracle network.
Allora Network — decentralized prediction-market AI; uses open weights for inference.
SaharaAI, Lilypad, Atoma, Pondhouse — newer entrants in the same space.
Bagel, OpenLedger, Sentient — emerging frameworks for "verifiable open-weight serving" — proving that the model run was the one claimed (often via TEE attestation or zkML).
Why this layer exists:
- Open weights are the only models that can be permissionlessly deployed (you can't sell access to a closed API on a public chain — the API key is centralized).
- Open weights enable verifiable inference: anyone can re-run the model and check the result.
- Decentralized inference enables data residency, censorship-resistance, and compliance use cases that closed APIs cannot.
The decentralized inference economics:
- Decentralized prices are typically 30-70% below hyperscaler hosted-OW prices on the same model.
- Reliability is lower (uptime variance across operators, occasional incorrect responses from misbehaving miners).
- Latency is mixed: best operators match centralized; worst are 5-10× slower.
- Best fit: workloads that are price-sensitive, batch-tolerant, and verification-friendly.
Chinese open weights on decentralized networks: Qwen, DeepSeek, GLM, Kimi all run on Bittensor subnets, Akash deployments, Hyperbolic, io.net. The combination of "Chinese open-weight frontier + decentralized inference" is increasingly the cheapest path to GPT-4-class quality. This is not yet widely discussed in mainstream AI media but is a major underlying current of the 2026 open-weight economy.
Companion reading: Decentralized GPU economics, Verifiable inference: proof of sampling.
A closer look: the Chinese open-weight labs
Because the 2026 frontier is so dominated by Chinese labs, it's worth being explicit about who they are, who funds them, and what they ship. (Capability tier as of May 2026 in parentheses.)
DeepSeek (deepseek.com, Hangzhou; spun from High-Flyer quant hedge fund). DeepSeek V3 / V3.1 / V3.2 / V4. MIT licence. The single biggest mover in open-weight credibility in 2024-2026. Their published papers (V3, R1) are foundational reading for any modern open-weight team. Tier S.
Alibaba / Qwen (qwen.ai, qwenlm.github.io, Hangzhou). Qwen 2.5 / 3 / 3.5 / 3.6 series. Apache 2.0 on most base models. Broadest open-weight model family (text, vision, audio, code, math specialists; sizes from 0.5B to 397B). Strong multilingual. Tier S.
Z.ai / Zhipu (z.ai, Beijing; close ties to Tsinghua University). GLM-4 / 4.5 / 4.6 / 5.1 + GLM-5V Turbo (vision). MIT. Strong on agentic harness benchmarks (ClawEval), coding, and Chinese-language tasks. Tier S.
Moonshot AI (kimi.com, Beijing; Alibaba-backed). Kimi K2 / K2.5 / K2.6 / K2.5-Thinking. Modified MIT. Long-context flagship (128K-256K). K2 Thinking series competitive on frontier reasoning. Tier S.
MiniMax (minimax.io, Shanghai). MiniMax M1 / M2 / M2.7. Non-commercial weights licence on most releases. Strong on speech, video gen, and reasoning. Tier S on capability, restricted on licence. The "Speech-02-HD" voice model is widely deployed.
Tencent Hunyuan (hunyuan.tencent.com, Shenzhen). Hunyuan 3 series. Apache 2.0 on most. Multimodal flagship + vision. Tier A.
Xiaomi MiMo (mimo.xiaomi.com, Beijing). MiMo V2 / V2.5 / V2.5-Pro + the OpenClaw agent harness. MIT-style. Strong on agentic capabilities; ClawEval benchmark suite is theirs. Tier A.
ByteDance Seed (seed.bytedance.com, Beijing). Seed-OSS, Seed-TTS series. Open-weight tier; closed for ByteDance internal frontier. Strong on speech, video gen, and ASR. Tier A.
InclusionAI / Ant Group (inclusion-ai.org, Hangzhou; affiliated with Ant). Ling 2.6 series, 1T MoE. Apache 2.0. Tier A.
Baidu ERNIE (ernie.baidu.com). ERNIE 4 / 5 series. Mostly closed; selected open-weight releases. Tier B (open-weight strength).
Huawei Noah Ark / Pangu (huawei.com). Pangu series. Mostly internal / partial open-weight. Tier B.
iFlytek (iflytek.com). Spark series. Speech-strong; selected open-weight releases. Tier B.
StepFun (stepfun.com, Shanghai). Step series. Open-weight mid-tier MoE. Tier B.
OpenBMB / MiniCPM (openbmb.ai, Tsinghua-affiliated). MiniCPM small-model series. Strong on small (1-4B) open weights. Tier B (specialist).
Smaller / specialist Chinese open-weight labs:
- 01.AI / Yi (Lee Kai-fu's lab; Yi-34B was a major 2024 release, less frontier-active in 2026)
- Skywork (Kunlun Tech; Skywork-13B and successors)
- Cohere Embed multilingual (Toronto-based but heavy China-market focus on multilingual coverage)
Where to track them:
- HuggingFace org pages:
deepseek-ai,Qwen,zai-org,moonshotai,MiniMaxAI,tencent,XiaomiMiMo,bytedance-research,openbmb,stepfun-ai,inclusionAI - ModelScope (
modelscope.cn) — Alibaba's HF equivalent; many Chinese labs publish here first - See /leaderboard/text and /news for live model and release tracking, including a labs-CN news category.
The pattern across this list: most ship under MIT or Apache 2.0; most release on Hugging Face within hours of internal launch; most publish technical reports with actual numbers (in contrast to US labs' system cards that often elide pretraining details). The combined release velocity from this group is what's driving the "open-weight catches closed every six months" pattern.
Provenance, watermarking, and licence compliance
Three things you have to think about when you ship anything based on open weights:
1. Licence compliance:
- Read the licence; pin to a commit hash; include attribution where required (Llama: "Built with Llama", Gemma: distribute licence with derivatives, Kimi K2 at >$20M ARR: show "Kimi K2" attribution).
- Check AUPs separately from licences. They often add restrictions.
- For commercial deployment, get sign-off from legal — particularly for the "almost-open" community licences and any Non-Commercial releases.
2. Watermarking and provenance:
- SynthID-Text (Google, open-sourced via Hugging Face Transformers in 2024) — applies an imperceptible logit-level watermark to LLM output. Works with any HF-compatible model. Companion: Provenance & Watermarking Standards on the data side.
- C2PA Content Credentials — for image/video generation; cryptographically signed manifests that travel with the file.
- EU AI Act Article 50 — from August 2026, AI-generated content must be machine-detectable. Open-weight image / video models that don't ship a watermark detection path are now a legal liability for downstream products in the EU.
3. Security review:
- Open weights are unverified by default. You should:
- Validate checksums on download (HF provides
safetensorshash). - Scan for code execution paths in custom
modeling_*.pyfiles (trust_remote_code=Trueis a code execution surface). - Watch for model-poisoning attacks (less common at frontier scale, real at fine-tune-on-untrusted-data scale).
- Treat outputs as untrusted, especially for safety-critical or agentic deployments.
- Validate checksums on download (HF provides
Where open weights match or beat closed (benchmark reality check)
As of May 2026, the honest comparison on public benchmarks:
Open weights match or beat closed:
- General-purpose chat (MT-Bench, Arena ELO): tier-S open weights within 30-50 ELO of frontier closed.
- Code generation (HumanEval, SWE-Bench Verified): DeepSeek V4, Qwen 3.6, GLM-5.1 within 5-10 points of GPT-5.4.
- Multilingual: Qwen and DeepSeek lead Chinese; Mistral and Aya lead European multilingual.
- Long-context retrieval (RULER, NIAH variants): Kimi K2 and Gemini both strong; open weights competitive at 256K.
- Cost-per-token at matched quality: open weights win by 10-30×.
Closed still leads:
- Frontier reasoning (HLE, FrontierMath, ARC-AGI-2): Opus 4.7 and GPT-5.5 still ahead by 5-15 points.
- Complex multi-step agent workflows (GAIA full set, BrowseComp, Toolathlon): closed leads by margin but gap shrinking.
- GDPval (economic-value tasks): GPT-5.5 at 84.9, Opus 4.7 at 80.3, top open weight (Kimi K2.6 + DeepSeek V4) around 65-72.
- Latency at the very low end (<100ms p50): closed APIs with custom hardware (Groq + closed model partnerships) edge out commodity open-weight serving.
Where the public benchmarks are misleading:
- Real-world agent reliability isn't captured by single-task benchmarks. Closed models still have a measurable edge in long-running agent loops that public evals don't surface well.
- Tool-use fidelity (correct JSON schema, function-call correctness, refusal-of-impossible-tool-call) is closer than benchmarks suggest in either direction.
- Hallucination rates are not well-measured publicly; both closed and open lie often on out-of-distribution domain queries.
See /leaderboard/research and /leaderboard/code for live benchmark data.
Where open weights still lose
A frank list:
Hardest reasoning tasks — Opus 4.7 / GPT-5.5 hold a 5-15 point edge on HLE / FrontierMath / GDPval. Open weights are within ~12 months of closing this; for now, when the task is "solve this novel problem nobody has seen," closed wins.
Tool-use reliability in long agent loops — closed models are still more reliable at "make 50 tool calls without going off the rails." Open weights catch the easy cases; the failure modes diverge after step ~20.
Built-in safety training — closed flagships ship with extensive red-teaming and safety post-training. Open weights vary widely; some are aggressively safety-trained (Anthropic's prior Claude OSS work, Meta's safety post-train), some have minimal alignment (most Chinese open releases ship with much lighter refusal). For consumer-facing or high-stakes use, factor in your own safety layer.
Multi-modal frontier (long video, mixed audio+vision+text) — closed flagships still lead. Gemini 3 Flash, GPT-5 multimodal, Claude vision have measurable lead over Qwen-VL / GLM-V / Kimi multimodal.
Function-calling APIs and SDK ergonomics — closed providers have invested years in SDK quality. Self-hosted open weights inherit OpenAI-compatible APIs but with rougher edges (less reliable JSON-schema enforcement, less elegant streaming, etc.).
First-party fine-tuning UX — OpenAI / Anthropic / Google offer one-click fine-tuning with managed evaluation. Open-weight fine-tuning is more powerful but requires more ops capacity.
Compliance certifications — closed APIs ship SOC2 / HIPAA / FedRAMP packets. Open-weight self-host requires you to bring your own.
Strategic risks: licence rug pulls, deprecation, security audits
Licence rug pulls: vendors can change licence terms on new releases. Examples:
- Stability AI moved from open OpenRAIL-M licences on early Stable Diffusion releases to subscription-based commercial terms.
- Mistral split licences: Mistral 7B and Codestral Apache, Mistral Large MNPL non-commercial.
- Llama introduced the 700M MAU clause in Llama 2; tightened Acceptable Use Policy over time.
- Cohere Command-R released open weights, but commercial deployment requires API.
Mitigation: pin to specific commits; archive the licence text at the time of download; budget for the possibility of needing to switch.
Deprecation / orphaning: a model you depend on may be superseded and lose community support. Mistral 8x22B is barely maintained. Llama 2 is functionally dead. Plan for migration.
Security audits: open weights are not pre-audited. You should:
- Avoid
trust_remote_code=Truefrom untrusted publishers. - Validate the
config.jsonandmodeling_*.pyfor unusual code paths. - Check for embedded URLs, command execution, file-write patterns in inference code.
- For safety-critical deployment, run a red-team eval pass on your specific use case before launch.
Supply chain:
- Hugging Face hosts most open weights. If HF is unavailable (sanctioned, deprecated, billing dispute), you have a single point of failure. Mirror to your own object store.
- Some Chinese models are also hosted on ModelScope (Alibaba) and Hugging Face mirrors. Treat HF as primary.
Reputation risk: a model with controversial outputs traces back to you when you ship it. Run safety evals; document mitigations; have an incident response plan.
Geopolitics: export controls, sanctions, and the openness backlash
The 2026 picture:
- US export controls (BIS rule sets) restrict shipment of H100/H200/B200 to China and other restricted destinations. Chinese labs have adapted with H800/H200/H20 (export-compliant variants), Huawei Ascend 910B, and improved training efficiency (DeepSeek's MoE + DualPipe + FP8 work).
- EU AI Act (in force in stages 2024-2027) treats general-purpose AI models above thresholds (10²⁵ FLOPs) as "with systemic risk" — regulatory burden falls on the model provider. Open-weight releases above this threshold trigger registration, evaluation, and incident-reporting requirements. Article 50 mandates labeling of AI-generated content from 2026.
- UK / Australia / Canada / India: generally less restrictive but signal-tracking US and EU.
- China's own rules (Cyberspace Administration of China algorithm regulation) require security review and content filtering for models offered as services in China; open-weight downloads are less restricted but using them in deployed services triggers compliance.
- Sanctions (Russia, Iran, North Korea, etc.) apply to US-origin technology. Open-weight downloads from US vendors to sanctioned jurisdictions are restricted; Chinese open weights are de facto unrestricted.
Practical implications:
- US companies generally avoid Chinese open weights for production deployment (procurement / legal / press risk) even though it costs them on cost and capability.
- EU companies are increasingly willing to use Chinese open weights for self-hosted enterprise deployment.
- Sovereign-AI initiatives (UAE, Saudi, India, Brazil, Indonesia) start from Chinese bases more often than US bases in 2026.
- Watch the "AI sovereignty" framing — it's increasingly used as a marketing wrapper around "we use open Chinese weights" by non-US, non-Chinese teams.
The "open-source AI" definitional fight (OSI, Llama-as-not-open)
In October 2024 the Open Source Initiative published OSAID 1.0 — the Open Source AI Definition. Among other things, it requires that data information be available in enough detail that an equivalent training corpus can be assembled. This explicitly excludes Llama, Gemma, Mistral, Qwen, DeepSeek, and most other "open" frontier models.
This caused a multi-month debate:
- Meta's position: Llama is open; OSI definition is too narrow.
- OSI's position: without training data, the model cannot be studied or reproduced; calling it open source is misleading.
- Industry usage: "open weights" gained traction as the more honest term; "open source AI" is now widely understood to mean OSAID-compliant.
OSAID-compliant releases at frontier scale: very few. OLMo and OLMo 2 (Allen AI), DCLM (DataComp consortium), Pythia (EleutherAI, historical), some smaller research models. None of these are at the absolute frontier on capability.
The practical takeaway: use "open weights" when you mean "weights downloadable, training data not necessarily disclosed" (which is what most releases are). Reserve "open source AI" for OSAID-compliant releases. This terminology hygiene saves arguments and saves your legal team confusion.
Patterns that work in 2026 production
What I see successful teams doing in 2026:
- Hybrid routing: classifier or rule-based router sends queries to closed API (5-10% hard cases) or self-hosted open weights (90%). Cost-effective and quality-preserving.
- Prefix caching aggressively. With agent workloads where system prompts and tool schemas are large and reused, prefix caching cuts effective input cost 5-10×.
- LoRA-per-tenant serving via vLLM multi-tenant. One base model + many small adapters. Companion: Multi-tenant LoRA.
- Distillation pipelines: generate gold outputs from a frontier closed model, train an open-weight specialist on them. Common for code-completion, classification, RAG-answer-generation tasks.
- Self-hosted for steady-state, hosted-OW for spikes. Reserve baseline capacity on your cluster; spill over to Together/Fireworks during traffic bursts.
- Model rotation discipline: re-evaluate the open-weight roster every 2-3 months. The frontier moves; whatever you deployed 6 months ago is probably superseded.
- Document licence acceptance at deployment time. Save the licence text + commit hash; record who approved it.
Patterns that don't (anti-patterns)
What burns teams in 2026:
- Deploying a frontier closed-model proof-of-concept and never re-evaluating against open weights. Costs run 10× higher than necessary.
- Self-hosting at low utilization. A $30k/month 8× H100 cluster serving 10M tokens/day is a bad trade vs paying $1k/month to Fireworks.
- Ignoring licence terms in the rush to ship. Mistral MNPL and MiniMax non-commercial weights regularly get deployed in commercial products by teams that didn't read.
trust_remote_code=Truefrom arbitrary publishers. This is code execution. Treat as untrusted.- Fine-tuning on uncleaned data and then deploying. Catastrophic backdoors and PII memorization happen.
- Skipping eval re-runs after a new base release. Models that worked at v2 sometimes regress at v3 on specific tasks.
- Treating open weights as a permanent free lunch. Every model deprecates; every licence can change.
2026 → 2027 outlook
Best-guess trajectory:
- Open-weight frontier continues to close the gap. By end-2026 I expect the absolute frontier closed-vs-open gap on HLE / FrontierMath / GDPval to narrow to 5-8 points (from 15+ today). By mid-2027 it may be within noise on most tasks.
- Chinese labs continue to dominate open-weight releases. No structural reason this changes; rate of release continues at ~1-2 frontier-class open releases per month.
- Meta's Llama 5 will be the next major US open-weight reset. Expect early 2027; expect significant ecosystem ripple.
- OpenAI's "open" tier (GPT-OSS) will likely expand from the small models released in 2025 to mid-tier in 2026-2027 as competitive pressure mounts.
- Inference cost continues to fall. The combination of FP8 training, MoE, speculative decoding, and prefix caching has compounded a ~10× cost reduction per year for matched quality since 2023. Expect another ~5-10× by end-2027.
- EU AI Act enforcement begins biting in 2026-2027. Watermarking, transparency reports, and incident reporting become operational requirements for anyone deploying open weights at scale into EU markets.
- Sovereign AI builds on open weights become the default. EU, UAE, India, Indonesia, Brazil increasingly fork from Qwen/DeepSeek/GLM rather than Llama, both for capability and political reasons.
- The OSAID-vs-community-licence terminology fight settles. By 2027, "open source AI" will broadly mean OSAID-compliant, and "open weights" will be the standard term for the rest. The press will catch up by 2028.
- A major licence rug pull or security incident will reset the conversation. The pattern of every 18 months is some open-weight provider changes terms or ships compromised weights; the industry adapts. Plan for it.
Further reading
Internal:
- vLLM PagedAttention and continuous batching
- KV cache: the inference memory math
- FP8 training tradeoffs
- Mixture-of-Experts serving
- Quantization tradeoffs (INT4, INT8, FP8)
- Post-training: RLHF and DPO
- Synthetic data and distillation
- Multi-tenant LoRA serving
- AI inference cost economics
- Speculative decoding
- Disaggregated inference: prefill and decode
- Reasoning model serving
- Eval infrastructure
- Agent serving infrastructure
- Production safety guardrails
- AI agent protocols (MCP, A2A, ACP)
External:
- Open Source AI Definition 1.0 (OSAID)
- Hugging Face Open LLM Leaderboard
- Artificial Analysis
- LMSys Arena
- SemiAnalysis
- DeepSeek technical reports
- Qwen technical reports
- Mistral release notes
- Meta Llama documentation
- Google Gemma
- Allen AI OLMo
- Live data: /leaderboard/text · /leaderboard/code · /leaderboard/research · /news