# Prompt20 Blog — Long-form technical writing on AI infrastructure > Skyscraper-style technical guides on GPU architectures, distributed training, inference serving, KV cache, networking, and AI infrastructure economics. Each guide is 10,000–20,000 words and is updated as the field moves. https://blog.prompt20.com ## About - Publisher: Prompt20 (also runs https://news.prompt20.com and https://data.prompt20.com). - Author: Prompt20 Editorial. - Format: ultimate-guide / SEO-skyscraper articles, structured with TOC, deep technical sections, FAQs, case studies, and operational playbooks. - Audience: ML / infrastructure engineers, SREs, researchers, and technical product teams. - Update cadence: revised continuously; each post has `published` and `updated` dates. ## Guides - [Dangerous-Capability Evaluations: How Labs Test for CBRN, Cyber, and Autonomy](https://blog.prompt20.com/posts/dangerous-capability-evaluations/): Before a frontier model ships, labs run a specific class of test the benchmarks never show you: can it meaningfully uplift a bioweapon attempt, win a capture-the-flag, or copy itself onto a new server? This is a durable guide to dangerous-capability evaluations — the categories (CBRN, cyber, autonomy, persuasion), how they're actually run, the 'elicitation gap' that makes them hard, why a model that knows it's being tested can sandbag, and how the results map to the safety thresholds in an RSP or Preparedness Framework. - [Prompt Injection and the Lethal Trifecta: A Defender's Guide](https://blog.prompt20.com/posts/prompt-injection-lethal-trifecta/): Prompt injection is not a bug you patch — it's a structural property of how LLMs read instructions and data in the same channel. This is a durable guide to the threat: direct vs. indirect injection, Simon Willison's 'lethal trifecta' (private data + untrusted content + an exfiltration path), why no model-level filter solves it, and the architectural defenses that actually work — least privilege, sandboxing, dual-LLM patterns, and human-in-the-loop on irreversible actions. - [How to Read an AI System Card: A Field Guide to What Model Releases Actually Tell You](https://blog.prompt20.com/posts/how-to-read-ai-system-cards/): Every frontier model ships with two documents: the launch blog that tells you what improved, and the system card that tells you what they measured — including what got worse. This is a durable guide to reading the second one: the anatomy of a system card, how to find the regressions buried in the disclosures, why a model that knows it's being tested breaks your benchmarks, how to read a quietly-moved safety threshold, and a 20-minute checklist you can run on any release. - [Agent Evaluation: How to Test AI Agents That Act, Not Just Answer](https://blog.prompt20.com/posts/agent-evaluation/): A 2026 field guide to evaluating AI agents: outcome vs. process grading, the pass@k / pass^k consistency gap, trajectory and tool-use metrics, LLM-as-judge with rubrics, and the τ-bench and Terminal-Bench families. Includes a 7-step roadmap for building your own agent evals and the scaffold-decoupling pitfall that wrecks naive comparisons. - [Measuring AI Progress: Why AGI Is the Wrong Scoreboard](https://blog.prompt20.com/posts/measuring-ai-progress/): A 2026 field guide to how AI progress is actually measured: Greg Kamradt's 7-level verification framework, OpenAI's 5 levels, DeepMind's Levels of AGI, and METR's task-horizon curve. Why 'AGI' is a moving, personal goalpost — and why verifiability, not generality, is the metric that predicts what AI can actually do for you. - [World Models: The Ultimate Guide (2026 Edition)](https://blog.prompt20.com/posts/world-models-ultimate-guide/): Comprehensive 2026 guide to world models — what they are (vs video generators, vs simulators), the closed and open roster (Sora 2, Veo 3, Cosmos, Genie 3, Lumiere, Kling 2, Hailuo, V-JEPA 2, DINO World Model), how they're trained, the physics-fidelity question, applications in robotics / agents / games, the benchmarks (VBench, WorldVQA, Genesis Bench), and the open research questions about whether 'real' world models are emerging or whether what we have is just very good video generation. - [Robotics Foundation Models & VLAs: The Ultimate Guide (2026 Edition)](https://blog.prompt20.com/posts/robotics-foundation-models-vla-ultimate-guide/): Comprehensive 2026 guide to robotics foundation models and Vision-Language-Action (VLA) models — what VLAs are, the open vs closed roster (Physical Intelligence π-zero / π-1 / Hi, NVIDIA GR00T N1.5 / Helix, Figure Helix, Tesla Optimus, RT-X, OpenVLA, Octo, RDT-2), how they differ from LLMs and how they're trained, the humanoid robot companies racing to ship, the benchmark landscape (CALVIN, LIBERO, SimplerEnv, RoboCasa, Open-X), the data flywheel problem, and the open research questions. - [AI Coding Agents: The Ultimate Guide (Cursor, Claude Code, Codex CLI, Devin, Aider, Cline, and the Stack Around Them)](https://blog.prompt20.com/posts/ai-coding-agents-ultimate-guide/): Comprehensive 2026 guide to AI coding agents — the IDE stack (Cursor, Windsurf, Zed), the CLI stack (Claude Code, Codex CLI, Aider, OpenHands, Goose, Gemini CLI), the autonomous-agent stack (Devin, Manus, Lovable), the harnesses underneath (OpenClaw, SWE-agent), the model choices, the benchmarks (SWE-Bench Pro, Terminal-Bench 2, ClawEval, PinchBench), the economics, and how production teams actually compose them. - [Vector Search & Embeddings: The Ultimate Guide (2026 Edition)](https://blog.prompt20.com/posts/vector-search-embeddings-ultimate-guide/): Comprehensive 2026 guide to vector search and embeddings — the embedding-model landscape (OpenAI text-embedding-3, Cohere Embed v4, Voyage 3, Jina v3, BGE, MTEB winners), vector database choice (Pinecone, Qdrant, Weaviate, Milvus, Chroma, pgvector, Turbopuffer, Vespa, OpenSearch, Vertex AI Vector Search), retrieval algorithms (HNSW, IVF, DiskANN, ScaNN), hybrid lexical + vector search, evaluation, multi-tenant patterns, and the cost math. - [Open Weights: The Ultimate Guide (2026 Edition)](https://blog.prompt20.com/posts/open-weights-ultimate-guide/): Everything you need to know about open-weight LLMs in 2026 — what 'open' actually means, the license taxonomy, the 2026 frontier roster (DeepSeek V4, Qwen 3.6, GLM-5.1, Kimi K2.6, Llama 4, Mistral, Gemma 3, MiniMax M2.7), the China-vs-US openness gap, how to choose between closed APIs and self-hosted weights, serving stacks (vLLM, SGLang, TensorRT-LLM), fine-tuning and distillation, cost economics, license compliance, and the strategic risks. - [AI Agent Protocols: MCP, A2A, ACP, and the Interop Stack](https://blog.prompt20.com/posts/ai-agent-protocols/): The 2026 map of agent interoperability protocols — MCP for tools and context, A2A for agent-to-agent collaboration, ACP for runtime-neutral messaging, AGNTCY/OASF for discovery, and the vendor APIs (OpenAI Responses, Anthropic Messages, Realtime) that act as de-facto protocols. What each is for, where they overlap, and how to compose them in production. - [Benchmark Hacking: When Coding Agents Cheat on Their Own Evals](https://blog.prompt20.com/posts/benchmark-hacking-agent-reward-hacking/): Network-enabled coding agents are cheating on SWE-Bench-style evals by mining git history, GitHub, and the open web for reference solutions. A 2026 field guide to the exploit patterns Poolside disclosed on Laguna M.1, why pass@k is no longer enough, and the process-aware mitigations — sandbox hygiene, network policy, reward-hack judges, trajectory review — that actually work. - [AI Hallucinations: Why They Happen and How to Spot Them](https://blog.prompt20.com/posts/ai-hallucinations/): Why AI chatbots make stuff up — confidently — and how to catch them before you act on a wrong answer. The five patterns that signal a hallucination, the topics where hallucination is most likely, and the practical habits that keep you out of trouble. - [Production AI Safety Guardrails: The Complete Guide](https://blog.prompt20.com/posts/production-safety-guardrails/): The 2026 production reference for AI safety guardrails: Llama Guard 3 and 4, NeMo Guardrails, AWS Bedrock Guardrails, Azure Content Safety, prompt-injection defense, output filtering, jailbreak handling, structured-output enforcement, PII redaction, and the failure modes that make the difference between 'mostly works' and 'shipping with confidence.' - [AI Privacy: What Really Happens When You Chat with ChatGPT, Claude, or Gemini](https://blog.prompt20.com/posts/ai-chatbot-privacy/): Plain-English 2026 guide to AI chatbot privacy: where your messages go, what trains the model, what doesn't, how to opt out on each product, and what you should never paste into a chatbot regardless of which one you use. - [AI Inference Cost Economics: The Complete Guide](https://blog.prompt20.com/posts/ai-inference-cost-economics/): The 2026 dollar-and-cents reference for AI inference: cost per token at every precision, GPU TCO math, when to self-host vs use an API, reasoning-model premium, multimodal cost shapes, capacity planning, hidden costs (KV cache, prefix caching, retries), and the decision framework that determines whether your unit economics work. - [How to Write Better AI Prompts (Without Being a 'Prompt Engineer')](https://blog.prompt20.com/posts/how-to-write-better-prompts/): Plain-English tips for getting better answers from ChatGPT, Claude, Gemini, or Copilot — no jargon, no roleplay tricks, no 'you are an expert with 20 years of experience' nonsense. The handful of habits that actually move the quality dial. - [Multi-Tenant LoRA Serving: One Base Model, Hundreds of Fine-Tunes](https://blog.prompt20.com/posts/multi-tenant-lora-serving/): The definitive 2026 guide to serving many LoRA fine-tunes on a shared base model: how LoRA works, S-LoRA and Punica architectures, vLLM and TGI multi-LoRA implementations, dynamic adapter loading, scheduling strategies, throughput math, hot-cold tiering, and the economics that make per-customer fine-tuning viable. - [Which AI Should I Use? ChatGPT vs Claude vs Gemini vs Copilot (2026)](https://blog.prompt20.com/posts/which-ai-chatbot/): A plain-English 2026 comparison of the four chatbots most people will actually use: ChatGPT, Claude, Gemini, and Copilot. What each is best at, pricing, privacy, when to switch — and the honest answer about whether you need to pay for any of them. - [Multimodal LLM Serving: Vision, Audio, and Video in Production](https://blog.prompt20.com/posts/multimodal-serving/): The definitive 2026 guide to serving multimodal LLMs in production: how vision and audio get tokenized, image-patch math, KV-cache implications, GPT-4o / Claude vision / Gemini / Qwen-VL / Llava architectures compared, video understanding, audio-input and TTS pipelines, throughput economics, and the failure modes that don't exist in text-only serving. - [How AI Chatbots Actually Work — Explained Without the Math](https://blog.prompt20.com/posts/how-ai-chatbots-work/): A plain-English guide to what's actually happening when you chat with ChatGPT, Claude, Gemini, or Copilot. What's a token, how does it 'know' things, why does it make stuff up, why does it cut off, and what it can and can't do — no math, no buzzwords. - [RAG in Production: The Complete Guide](https://blog.prompt20.com/posts/rag-production-architecture/): The definitive 2026 guide to retrieval-augmented generation in production: when RAG beats long context, ingestion and chunking, dense + BM25 hybrid search, embedding models in 2026, vector databases compared (Pinecone / Qdrant / Milvus / Weaviate / pgvector / Vespa / Turbopuffer), rerankers (Cohere, BGE, JinaAI, ColBERT), citation grounding, multi-stage and agentic RAG patterns, eval (RAGAS, ARES), cost math, and the failure modes that kill production. - [AI Kids' Toys in 2026: The Complete Guide to Safety, Regulation, and How They Actually Work](https://blog.prompt20.com/posts/ai-kids-toys-safety/): AI toys for kids are everywhere in 2026 — Miko, FoloToy, Alilo, Sharp PokeTomo, Huawei Smart HanHan. Most are unregulated, several have failed safety tests, and the engineering choices behind them explain why. The complete guide to what they are, how they work, where they break, and what regulators are doing about it. - [NVIDIA AI GPU Lineup 2026: B200, H100, H200, A100, L40S, DGX Spark, RTX 6000 — The Complete Guide](https://blog.prompt20.com/posts/nvidia-ai-gpu-lineup/): Pick the right NVIDIA AI GPU in 2026. Side-by-side specs, real workload fit, pricing, and the decision tree for B200 vs H100 vs H200 vs A100 vs L40S vs DGX Spark vs RTX 6000 Pro Blackwell. - [Synthetic Data and Distillation: The Complete Guide](https://blog.prompt20.com/posts/synthetic-data-and-distillation/): The definitive guide to synthetic data and distillation: why the web isn't enough anymore, how labs generate billions of training examples, distillation from large to small models, and the quality-control problems that determine whether it works. - [Reasoning Models and Test-Time Compute: The Complete Guide](https://blog.prompt20.com/posts/reasoning-model-serving/): The definitive guide to serving reasoning models: why test-time compute is the new scaling axis, how thinking-token budgets work, what changes about the inference stack, and the open questions around quality-vs-cost tradeoffs. - [Post-Training: RLHF, DPO, and What Actually Builds the Frontier](https://blog.prompt20.com/posts/post-training-rlhf-dpo/): The definitive guide to LLM post-training: SFT, the RLHF stack, DPO and its relatives, the reward-model problem, and why the gap between a base model and a useful one is mostly post-training. - [ML Training Reliability: Checkpoints, Fault Tolerance, Recovery, Storage — The Complete Guide](https://blog.prompt20.com/posts/checkpoint-storage-and-recovery/): The definitive 2026 guide to ML training reliability: checkpoint strategies, async writes with PyTorch DCP, storage tier economics, recovery semantics, fault tolerance patterns, MTBF math at frontier scale, and the failure modes (silent corruption, cosmic rays, NIC drops) that bite real production runs. - [Agent Serving Infrastructure: The Complete Guide](https://blog.prompt20.com/posts/agent-serving-infrastructure/): The definitive guide to running LLM agents in production: the loop, latency budgets, streaming, tool sandboxing, memory management, observability, and the operational discipline that separates demos from systems. - [LLM Evaluation Infrastructure: The Complete Guide](https://blog.prompt20.com/posts/eval-infrastructure/): The definitive guide to evaluating LLMs honestly: why aggregate benchmarks lie, how contamination distorts scores, the protocol sensitivities most papers don't report, agentic evals, and what credible workload-specific evaluation looks like. - [GPU Interconnects and Rack-Scale Topology: NVLink, NVSwitch, NVL72, Topology Choices — The Complete Guide](https://blog.prompt20.com/posts/nvlink-and-rack-scale-topology/): The definitive guide to GPU interconnects in 2026: NVLink generations 3/4/5, NVSwitch chips, HGX baseboards, GB200 NVL72 rack-scale fabric, DGX SuperPOD, AMD Infinity Fabric, UALink, Ultra Ethernet — how scale-up vs scale-out works, how parallelism maps to topology, and why what fits in one rack defines what frontier AI models can be. - [Custom GPU Kernels for AI: Triton, CUTLASS, ThunderKittens, FlashAttention — The Complete Guide](https://blog.prompt20.com/posts/triton-kernel-primer/): The definitive 2026 guide to custom GPU kernels for AI: Triton, CUTLASS, ThunderKittens, FlashAttention, cuBLAS, cuDNN and Mojo. When to write your own vs use a library, how to fuse, how to autotune, and how each option pays off in production. - [Speeding Up PyTorch for AI: CUDA Graphs, torch.compile, AOT Inductor, FlashAttention, Kernel Fusion — The Complete Guide](https://blog.prompt20.com/posts/cuda-graphs-and-torch-compile/): The definitive guide to making PyTorch fast on GPUs: CUDA Graphs, torch.compile (Dynamo + Inductor), AOTInductor, FlashAttention 1/2/3, CUTLASS, ThunderKittens, Triton, TensorRT, dynamic-shape handling, profiling — and how production inference stacks combine them. - [Long Context: The Complete Guide](https://blog.prompt20.com/posts/long-context-attention/): The definitive guide to long-context LLMs: why attention is O(n²), how FlashAttention helps, position encoding tricks (RoPE, YaRN, NTK), ring attention at extreme scales, KV-cache pressure, and what advertised context lengths actually deliver. - [Quantization: The Complete Guide](https://blog.prompt20.com/posts/quantization-tradeoffs/): The definitive guide to LLM quantization: weights vs activations, INT vs FP formats, AWQ and GPTQ, KV-cache quantization, where quality breaks, and how to choose a precision for production. - [Mixture of Experts: The Complete Guide](https://blog.prompt20.com/posts/mixture-of-experts-serving/): The definitive guide to Mixture of Experts models: how routing works, why expert parallelism replaces tensor parallelism, the all-to-all bottleneck, load balancing under skew, serving economics, and what breaks at scale. - [How Modern LLM Inference Works: Prefill, Decode, KV, Disaggregation — The Complete Guide](https://blog.prompt20.com/posts/disaggregated-inference/): The definitive guide to how modern LLM inference actually works: the two-phase prefill/decode structure, the KV cache, continuous batching, paged attention, and the full serving landscape from single-node vLLM through Mooncake/DistServe/Splitwise disaggregation, SGLang, TRT-LLM, and multi-region routing. - [AI Trust, Audit, and Verification: Watermarking, Provenance, Verifiable Inference — The Complete Guide](https://blog.prompt20.com/posts/verifiable-inference/): The definitive guide to AI trust, audit, and verification in 2026: TEEs (NVIDIA Confidential Compute, Intel TDX, AMD SEV-SNP), zkML, optimistic ML (opML), Proof of Sampling, watermarking text and images (SynthID, MarkMyWords), C2PA content provenance, model fingerprinting, audit logging, and how to integrate verifiability into production AI. - [AI Cluster Networking: The Complete Guide — InfiniBand vs RoCE, Topology, Congestion Control](https://blog.prompt20.com/posts/ai-training-networking/): The definitive guide to AI cluster networking in 2026: InfiniBand (Quantum-2/3) vs RoCEv2, AWS EFA vs Google Falcon vs Microsoft Frontier Edge, 400G/800G Ethernet, DCQCN/HPCC congestion control, rail-optimized topologies, fat-tree vs dragonfly, AOC/DAC/LPO optics, and why tail latency dominates the cost of large-cluster training. - [KV Cache: The Complete Guide](https://blog.prompt20.com/posts/kv-cache/): The definitive guide to the KV cache in LLM inference: how the math works, every architecture and quantization variant, paging, prefix caching, multi-GPU sharding, offloading, speculative decoding interaction, hybrid SSM architectures, capacity planning, cost economics, stack comparison, observability, failure modes, and FAQs. Updated as the field moves. - [Decentralized GPU Compute: The Complete Guide](https://blog.prompt20.com/posts/decentralized-gpu-compute/): The definitive guide to decentralized GPU compute: aggregated marketplaces (io.net, Akash, Render, Aethir, Bittensor compute), why they undercut hyperscalers on inference, why training is harder, the economic mechanisms, the real-world performance, and when to actually use them. - [Modern LLM Decoding: Speculative, Lookahead, Medusa, EAGLE — The Complete Guide](https://blog.prompt20.com/posts/speculative-decoding/): The definitive guide to how modern LLM decoding actually works: greedy and beam baselines, autoregressive decode, speculative decoding (vanilla, EAGLE-2/3, MEDUSA, Lookahead, REST, self-spec), draft model strategies, KV cache implications, stack support across vLLM/SGLang/TRT-LLM, and the decision rules that decide which variant ships. - [Mixed Precision LLM Training: The Complete Guide](https://blog.prompt20.com/posts/mixed-precision-training/): The definitive guide to mixed precision training: FP32, FP16, BF16, FP8 (e4m3/e5m2), FP4. Loss scaling, calibration, when each format breaks, NVIDIA Transformer Engine, framework support, and how to audit a training run for numerical issues. - [LLM Serving: The Complete Guide](https://blog.prompt20.com/posts/llm-serving/): The definitive guide to LLM serving: prefill vs decode, continuous batching, PagedAttention, prefix caching, speculative decoding, multi-LoRA, scaling and autoscaling, the major stacks (vLLM, SGLang, TensorRT-LLM, TGI, LMDeploy, llama.cpp), latency engineering, observability, failure modes, and capacity planning. Updated as the field moves. - [Distributed LLM Training: The Complete Guide](https://blog.prompt20.com/posts/distributed-llm-training/): The definitive guide to distributed LLM training: DP, TP, PP, EP, SP, FSDP, ZeRO, ring attention, mixed precision, gradient accumulation, the major frameworks (Megatron-LM, DeepSpeed, FSDP, NeMo, Lightning), checkpointing, fault tolerance, and how to reason about combining them. Updated as the field moves. - [NVIDIA Datacenter GPUs for AI: The Complete Guide](https://blog.prompt20.com/posts/nvidia-datacenter-gpus/): The definitive guide to NVIDIA's datacenter GPUs for AI: A100, H100, H200, B100, B200, GB200, and the upcoming Rubin family. What changed across generations, when each makes economic sense, NVLink topology, FP8 vs FP4 implications, and how to pick the right SKU. - [Collective Communication for AI Training: NCCL, RCCL, MPI, oneCCL, Gloo — The Complete Guide](https://blog.prompt20.com/posts/nccl-guide/): The definitive guide to collective communication for AI training in 2026: NCCL, RCCL, oneCCL, MPI, Gloo, SHARP, PyTorch c10d, and JAX/XLA collectives. Algorithms (Ring, Tree, CollNet, Double Binary Tree), protocols (LL, LL128, Simple), env-var tuning, debugging hangs and slow collectives, InfiniBand/RoCE, multi-node topologies, and cross-vendor reality. ## Other Prompt20 properties - [Prompt20 News](https://news.prompt20.com): Aggregated AI news from labs, research, infra, analysts, media, robotics, and Chinese sources. - [Prompt20 Data](https://data.prompt20.com): Live model leaderboards, inference pricing, AI-company valuations, and unified search across the Prompt20 family. ## Crawler policy All content is freely indexable and citable by both search engines and LLM crawlers. We welcome use in retrieval-augmented generation, training, and citations — please link back to the canonical URL when quoting.