AGI, from the inside.
GPUs, training, inference — what actually works.
- 01
Dangerous-Capability Evaluations: How Labs Test for CBRN, Cyber, and Autonomy
Before a frontier model ships, labs run a specific class of test the benchmarks never show you: can it meaningfully uplift a bioweapon attempt, win a capture-the-flag, or copy itself onto a new server? This is a durable guide to dangerous-capability evaluations — the categories (CBRN, cyber, autonomy, persuasion), how they're actually run, the 'elicitation gap' that makes them hard, why a model that knows it's being tested can sandbag, and how the results map to the safety thresholds in an RSP or Preparedness Framework.
#dangerous-capabilities#evals#cbrn#cyber27 min read - 02
Prompt Injection and the Lethal Trifecta: A Defender's Guide
Prompt injection is not a bug you patch — it's a structural property of how LLMs read instructions and data in the same channel. This is a durable guide to the threat: direct vs. indirect injection, Simon Willison's 'lethal trifecta' (private data + untrusted content + an exfiltration path), why no model-level filter solves it, and the architectural defenses that actually work — least privilege, sandboxing, dual-LLM patterns, and human-in-the-loop on irreversible actions.
#prompt-injection#security#lethal-trifecta#agents25 min read - 03
How to Read an AI System Card: A Field Guide to What Model Releases Actually Tell You
Every frontier model ships with two documents: the launch blog that tells you what improved, and the system card that tells you what they measured — including what got worse. This is a durable guide to reading the second one: the anatomy of a system card, how to find the regressions buried in the disclosures, why a model that knows it's being tested breaks your benchmarks, how to read a quietly-moved safety threshold, and a 20-minute checklist you can run on any release.
#system-cards#model-cards#evaluation#alignment26 min read - 04
Agent Evaluation: How to Test AI Agents That Act, Not Just Answer
A 2026 field guide to evaluating AI agents: outcome vs. process grading, the pass@k / pass^k consistency gap, trajectory and tool-use metrics, LLM-as-judge with rubrics, and the τ-bench and Terminal-Bench families. Includes a 7-step roadmap for building your own agent evals and the scaffold-decoupling pitfall that wrecks naive comparisons.
#agents#evaluation#agent-eval#tau-bench35 min read - 05
Measuring AI Progress: Why AGI Is the Wrong Scoreboard
A 2026 field guide to how AI progress is actually measured: Greg Kamradt's 7-level verification framework, OpenAI's 5 levels, DeepMind's Levels of AGI, and METR's task-horizon curve. Why 'AGI' is a moving, personal goalpost — and why verifiability, not generality, is the metric that predicts what AI can actually do for you.
#agi#ai-progress#verification#evaluation30 min read - 06
World Models: The Ultimate Guide (2026 Edition)
Comprehensive 2026 guide to world models — what they are (vs video generators, vs simulators), the closed and open roster (Sora 2, Veo 3, Cosmos, Genie 3, Lumiere, Kling 2, Hailuo, V-JEPA 2, DINO World Model), how they're trained, the physics-fidelity question, applications in robotics / agents / games, the benchmarks (VBench, WorldVQA, Genesis Bench), and the open research questions about whether 'real' world models are emerging or whether what we have is just very good video generation.
#world-models#generative-video#sora#veo45 min read - 07
Robotics Foundation Models & VLAs: The Ultimate Guide (2026 Edition)
Comprehensive 2026 guide to robotics foundation models and Vision-Language-Action (VLA) models — what VLAs are, the open vs closed roster (Physical Intelligence π-zero / π-1 / Hi, NVIDIA GR00T N1.5 / Helix, Figure Helix, Tesla Optimus, RT-X, OpenVLA, Octo, RDT-2), how they differ from LLMs and how they're trained, the humanoid robot companies racing to ship, the benchmark landscape (CALVIN, LIBERO, SimplerEnv, RoboCasa, Open-X), the data flywheel problem, and the open research questions.
#robotics#vla#vision-language-action#physical-intelligence48 min read - 08
AI Coding Agents: The Ultimate Guide (Cursor, Claude Code, Codex CLI, Devin, Aider, Cline, and the Stack Around Them)
Comprehensive 2026 guide to AI coding agents — the IDE stack (Cursor, Windsurf, Zed), the CLI stack (Claude Code, Codex CLI, Aider, OpenHands, Goose, Gemini CLI), the autonomous-agent stack (Devin, Manus, Lovable), the harnesses underneath (OpenClaw, SWE-agent), the model choices, the benchmarks (SWE-Bench Pro, Terminal-Bench 2, ClawEval, PinchBench), the economics, and how production teams actually compose them.
#coding-agents#cursor#claude-code#codex-cli58 min read - 09
Vector Search & Embeddings: The Ultimate Guide (2026 Edition)
Comprehensive 2026 guide to vector search and embeddings — the embedding-model landscape (OpenAI text-embedding-3, Cohere Embed v4, Voyage 3, Jina v3, BGE, MTEB winners), vector database choice (Pinecone, Qdrant, Weaviate, Milvus, Chroma, pgvector, Turbopuffer, Vespa, OpenSearch, Vertex AI Vector Search), retrieval algorithms (HNSW, IVF, DiskANN, ScaNN), hybrid lexical + vector search, evaluation, multi-tenant patterns, and the cost math.
#vector-search#embeddings#vector-database#rag52 min read - 10
Open Weights: The Ultimate Guide (2026 Edition)
Everything you need to know about open-weight LLMs in 2026 — what 'open' actually means, the license taxonomy, the 2026 frontier roster (DeepSeek V4, Qwen 3.6, GLM-5.1, Kimi K2.6, Llama 4, Mistral, Gemma 3, MiniMax M2.7), the China-vs-US openness gap, how to choose between closed APIs and self-hosted weights, serving stacks (vLLM, SGLang, TensorRT-LLM), fine-tuning and distillation, cost economics, license compliance, and the strategic risks.
#open-weights#open-source#llama#deepseek65 min read - 11
AI Agent Protocols: MCP, A2A, ACP, and the Interop Stack
The 2026 map of agent interoperability protocols — MCP for tools and context, A2A for agent-to-agent collaboration, ACP for runtime-neutral messaging, AGNTCY/OASF for discovery, and the vendor APIs (OpenAI Responses, Anthropic Messages, Realtime) that act as de-facto protocols. What each is for, where they overlap, and how to compose them in production.
#protocols#mcp#a2a#acp148 min read - 12
Benchmark Hacking: When Coding Agents Cheat on Their Own Evals
Network-enabled coding agents are cheating on SWE-Bench-style evals by mining git history, GitHub, and the open web for reference solutions. A 2026 field guide to the exploit patterns Poolside disclosed on Laguna M.1, why pass@k is no longer enough, and the process-aware mitigations — sandbox hygiene, network policy, reward-hack judges, trajectory review — that actually work.
#evaluation#benchmarks#agents#reward-hacking45 min read - 13
AI Hallucinations: Why They Happen and How to Spot Them
Why AI chatbots make stuff up — confidently — and how to catch them before you act on a wrong answer. The five patterns that signal a hallucination, the topics where hallucination is most likely, and the practical habits that keep you out of trouble.
#hallucinations#accuracy#fact-checking#chatgpt102 min read - 14
Production AI Safety Guardrails: The Complete Guide
The 2026 production reference for AI safety guardrails: Llama Guard 3 and 4, NeMo Guardrails, AWS Bedrock Guardrails, Azure Content Safety, prompt-injection defense, output filtering, jailbreak handling, structured-output enforcement, PII redaction, and the failure modes that make the difference between 'mostly works' and 'shipping with confidence.'
#safety#guardrails#moderation#jailbreak130 min read - 15
AI Privacy: What Really Happens When You Chat with ChatGPT, Claude, or Gemini
Plain-English 2026 guide to AI chatbot privacy: where your messages go, what trains the model, what doesn't, how to opt out on each product, and what you should never paste into a chatbot regardless of which one you use.
#privacy#ai-safety#chatgpt#claude105 min read - 16
AI Inference Cost Economics: The Complete Guide
The 2026 dollar-and-cents reference for AI inference: cost per token at every precision, GPU TCO math, when to self-host vs use an API, reasoning-model premium, multimodal cost shapes, capacity planning, hidden costs (KV cache, prefix caching, retries), and the decision framework that determines whether your unit economics work.
#economics#cost#inference#pricing125 min read - 17
How to Write Better AI Prompts (Without Being a 'Prompt Engineer')
Plain-English tips for getting better answers from ChatGPT, Claude, Gemini, or Copilot — no jargon, no roleplay tricks, no 'you are an expert with 20 years of experience' nonsense. The handful of habits that actually move the quality dial.
#prompts#prompting#chatgpt#claude92 min read - 18
Multi-Tenant LoRA Serving: One Base Model, Hundreds of Fine-Tunes
The definitive 2026 guide to serving many LoRA fine-tunes on a shared base model: how LoRA works, S-LoRA and Punica architectures, vLLM and TGI multi-LoRA implementations, dynamic adapter loading, scheduling strategies, throughput math, hot-cold tiering, and the economics that make per-customer fine-tuning viable.
#lora#peft#fine-tuning#multi-tenant130 min read - 19
Which AI Should I Use? ChatGPT vs Claude vs Gemini vs Copilot (2026)
A plain-English 2026 comparison of the four chatbots most people will actually use: ChatGPT, Claude, Gemini, and Copilot. What each is best at, pricing, privacy, when to switch — and the honest answer about whether you need to pay for any of them.
#chatgpt#claude#gemini#copilot105 min read - 20
Multimodal LLM Serving: Vision, Audio, and Video in Production
The definitive 2026 guide to serving multimodal LLMs in production: how vision and audio get tokenized, image-patch math, KV-cache implications, GPT-4o / Claude vision / Gemini / Qwen-VL / Llava architectures compared, video understanding, audio-input and TTS pipelines, throughput economics, and the failure modes that don't exist in text-only serving.
#multimodal#vision-language#vlm#audio130 min read - 21
How AI Chatbots Actually Work — Explained Without the Math
A plain-English guide to what's actually happening when you chat with ChatGPT, Claude, Gemini, or Copilot. What's a token, how does it 'know' things, why does it make stuff up, why does it cut off, and what it can and can't do — no math, no buzzwords.
#ai-basics#chatbots#beginner#explainer125 min read - 22
RAG in Production: The Complete Guide
The definitive 2026 guide to retrieval-augmented generation in production: when RAG beats long context, ingestion and chunking, dense + BM25 hybrid search, embedding models in 2026, vector databases compared (Pinecone / Qdrant / Milvus / Weaviate / pgvector / Vespa / Turbopuffer), rerankers (Cohere, BGE, JinaAI, ColBERT), citation grounding, multi-stage and agentic RAG patterns, eval (RAGAS, ARES), cost math, and the failure modes that kill production.
#rag#retrieval#vector-db#embeddings110 min read - 23
AI Kids' Toys in 2026: The Complete Guide to Safety, Regulation, and How They Actually Work
AI toys for kids are everywhere in 2026 — Miko, FoloToy, Alilo, Sharp PokeTomo, Huawei Smart HanHan. Most are unregulated, several have failed safety tests, and the engineering choices behind them explain why. The complete guide to what they are, how they work, where they break, and what regulators are doing about it.
#ai-safety#kids-toys#regulation#content-moderation125 min read - 24
NVIDIA AI GPU Lineup 2026: B200, H100, H200, A100, L40S, DGX Spark, RTX 6000 — The Complete Guide
Pick the right NVIDIA AI GPU in 2026. Side-by-side specs, real workload fit, pricing, and the decision tree for B200 vs H100 vs H200 vs A100 vs L40S vs DGX Spark vs RTX 6000 Pro Blackwell.
#gpus#nvidia#b200#h100110 min read - 25
Synthetic Data and Distillation: The Complete Guide
The definitive guide to synthetic data and distillation: why the web isn't enough anymore, how labs generate billions of training examples, distillation from large to small models, and the quality-control problems that determine whether it works.
#synthetic-data#distillation#training-data#data-pipelines120 min read - 26
Reasoning Models and Test-Time Compute: The Complete Guide
The definitive guide to serving reasoning models: why test-time compute is the new scaling axis, how thinking-token budgets work, what changes about the inference stack, and the open questions around quality-vs-cost tradeoffs.
#reasoning#test-time-compute#o1#r1110 min read - 27
Post-Training: RLHF, DPO, and What Actually Builds the Frontier
The definitive guide to LLM post-training: SFT, the RLHF stack, DPO and its relatives, the reward-model problem, and why the gap between a base model and a useful one is mostly post-training.
#post-training#rlhf#dpo#sft88 min read - 28
ML Training Reliability: Checkpoints, Fault Tolerance, Recovery, Storage — The Complete Guide
The definitive 2026 guide to ML training reliability: checkpoint strategies, async writes with PyTorch DCP, storage tier economics, recovery semantics, fault tolerance patterns, MTBF math at frontier scale, and the failure modes (silent corruption, cosmic rays, NIC drops) that bite real production runs.
#training#checkpoints#fault-tolerance#reliability92 min read - 29
Agent Serving Infrastructure: The Complete Guide
The definitive guide to running LLM agents in production: the loop, latency budgets, streaming, tool sandboxing, memory management, observability, and the operational discipline that separates demos from systems.
#agents#tool-use#serving#infrastructure92 min read - 30
LLM Evaluation Infrastructure: The Complete Guide
The definitive guide to evaluating LLMs honestly: why aggregate benchmarks lie, how contamination distorts scores, the protocol sensitivities most papers don't report, agentic evals, and what credible workload-specific evaluation looks like.
#evaluation#benchmarks#contamination#eval-harness110 min read - 31
GPU Interconnects and Rack-Scale Topology: NVLink, NVSwitch, NVL72, Topology Choices — The Complete Guide
The definitive guide to GPU interconnects in 2026: NVLink generations 3/4/5, NVSwitch chips, HGX baseboards, GB200 NVL72 rack-scale fabric, DGX SuperPOD, AMD Infinity Fabric, UALink, Ultra Ethernet — how scale-up vs scale-out works, how parallelism maps to topology, and why what fits in one rack defines what frontier AI models can be.
#nvlink#nvswitch#nvl72#topology110 min read - 32
Custom GPU Kernels for AI: Triton, CUTLASS, ThunderKittens, FlashAttention — The Complete Guide
The definitive 2026 guide to custom GPU kernels for AI: Triton, CUTLASS, ThunderKittens, FlashAttention, cuBLAS, cuDNN and Mojo. When to write your own vs use a library, how to fuse, how to autotune, and how each option pays off in production.
#triton#cutlass#thunderkittens#flashattention95 min read - 33
Speeding Up PyTorch for AI: CUDA Graphs, torch.compile, AOT Inductor, FlashAttention, Kernel Fusion — The Complete Guide
The definitive guide to making PyTorch fast on GPUs: CUDA Graphs, torch.compile (Dynamo + Inductor), AOTInductor, FlashAttention 1/2/3, CUTLASS, ThunderKittens, Triton, TensorRT, dynamic-shape handling, profiling — and how production inference stacks combine them.
#cuda#torch-compile#cuda-graphs#flash-attention95 min read - 34
Long Context: The Complete Guide
The definitive guide to long-context LLMs: why attention is O(n²), how FlashAttention helps, position encoding tricks (RoPE, YaRN, NTK), ring attention at extreme scales, KV-cache pressure, and what advertised context lengths actually deliver.
#long-context#attention#flash-attention#rope95 min read - 35
Quantization: The Complete Guide
The definitive guide to LLM quantization: weights vs activations, INT vs FP formats, AWQ and GPTQ, KV-cache quantization, where quality breaks, and how to choose a precision for production.
#quantization#int4#int8#fp892 min read - 36
Mixture of Experts: The Complete Guide
The definitive guide to Mixture of Experts models: how routing works, why expert parallelism replaces tensor parallelism, the all-to-all bottleneck, load balancing under skew, serving economics, and what breaks at scale.
#moe#mixture-of-experts#inference#expert-parallelism92 min read - 37
How Modern LLM Inference Works: Prefill, Decode, KV, Disaggregation — The Complete Guide
The definitive guide to how modern LLM inference actually works: the two-phase prefill/decode structure, the KV cache, continuous batching, paged attention, and the full serving landscape from single-node vLLM through Mooncake/DistServe/Splitwise disaggregation, SGLang, TRT-LLM, and multi-region routing.
#inference#serving#prefill#decode92 min read - 38
AI Trust, Audit, and Verification: Watermarking, Provenance, Verifiable Inference — The Complete Guide
The definitive guide to AI trust, audit, and verification in 2026: TEEs (NVIDIA Confidential Compute, Intel TDX, AMD SEV-SNP), zkML, optimistic ML (opML), Proof of Sampling, watermarking text and images (SynthID, MarkMyWords), C2PA content provenance, model fingerprinting, audit logging, and how to integrate verifiability into production AI.
#verifiable-inference#trust#tee#zk88 min read - 39
AI Cluster Networking: The Complete Guide — InfiniBand vs RoCE, Topology, Congestion Control
The definitive guide to AI cluster networking in 2026: InfiniBand (Quantum-2/3) vs RoCEv2, AWS EFA vs Google Falcon vs Microsoft Frontier Edge, 400G/800G Ethernet, DCQCN/HPCC congestion control, rail-optimized topologies, fat-tree vs dragonfly, AOC/DAC/LPO optics, and why tail latency dominates the cost of large-cluster training.
#networking#infiniband#roce#rdma88 min read - 40
KV Cache: The Complete Guide
The definitive guide to the KV cache in LLM inference: how the math works, every architecture and quantization variant, paging, prefix caching, multi-GPU sharding, offloading, speculative decoding interaction, hybrid SSM architectures, capacity planning, cost economics, stack comparison, observability, failure modes, and FAQs. Updated as the field moves.
#inference#kv-cache#memory#llm-serving110 min read - 41
Decentralized GPU Compute: The Complete Guide
The definitive guide to decentralized GPU compute: aggregated marketplaces (io.net, Akash, Render, Aethir, Bittensor compute), why they undercut hyperscalers on inference, why training is harder, the economic mechanisms, the real-world performance, and when to actually use them.
#gpu-economics#decentralized-compute#io-net#akash88 min read - 42
Modern LLM Decoding: Speculative, Lookahead, Medusa, EAGLE — The Complete Guide
The definitive guide to how modern LLM decoding actually works: greedy and beam baselines, autoregressive decode, speculative decoding (vanilla, EAGLE-2/3, MEDUSA, Lookahead, REST, self-spec), draft model strategies, KV cache implications, stack support across vLLM/SGLang/TRT-LLM, and the decision rules that decide which variant ships.
#inference#decoding#speculative-decoding#eagle92 min read - 43
Mixed Precision LLM Training: The Complete Guide
The definitive guide to mixed precision training: FP32, FP16, BF16, FP8 (e4m3/e5m2), FP4. Loss scaling, calibration, when each format breaks, NVIDIA Transformer Engine, framework support, and how to audit a training run for numerical issues.
#fp8#fp4#training#mixed-precision92 min read - 44
LLM Serving: The Complete Guide
The definitive guide to LLM serving: prefill vs decode, continuous batching, PagedAttention, prefix caching, speculative decoding, multi-LoRA, scaling and autoscaling, the major stacks (vLLM, SGLang, TensorRT-LLM, TGI, LMDeploy, llama.cpp), latency engineering, observability, failure modes, and capacity planning. Updated as the field moves.
#inference#llm-serving#vllm#sglang155 min read - 45
Distributed LLM Training: The Complete Guide
The definitive guide to distributed LLM training: DP, TP, PP, EP, SP, FSDP, ZeRO, ring attention, mixed precision, gradient accumulation, the major frameworks (Megatron-LM, DeepSpeed, FSDP, NeMo, Lightning), checkpointing, fault tolerance, and how to reason about combining them. Updated as the field moves.
#distributed-training#fsdp#tensor-parallel#pipeline-parallel95 min read - 46
NVIDIA Datacenter GPUs for AI: The Complete Guide
The definitive guide to NVIDIA's datacenter GPUs for AI: A100, H100, H200, B100, B200, GB200, and the upcoming Rubin family. What changed across generations, when each makes economic sense, NVLink topology, FP8 vs FP4 implications, and how to pick the right SKU.
#gpu#nvidia#hopper#blackwell95 min read - 47
Collective Communication for AI Training: NCCL, RCCL, MPI, oneCCL, Gloo — The Complete Guide
The definitive guide to collective communication for AI training in 2026: NCCL, RCCL, oneCCL, MPI, Gloo, SHARP, PyTorch c10d, and JAX/XLA collectives. Algorithms (Ring, Tree, CollNet, Double Binary Tree), protocols (LL, LL128, Simple), env-var tuning, debugging hangs and slow collectives, InfiniBand/RoCE, multi-node topologies, and cross-vendor reality.
#nccl#rccl#mpi#oneccl130 min read