Prompt20
All posts
post-trainingrlhfdposftalignmentguidegrporlvr

Post-Training: RLHF, DPO, and What Actually Builds the Frontier

The definitive guide to LLM post-training: SFT, the RLHF stack, DPO and its relatives, the reward-model problem, and why the gap between a base model and a useful one is mostly post-training.

By Prompt20 Editorial · 88 min read

The base model from pretraining is fluent and bad at being useful. It will complete prompts plausibly but won't follow instructions, refuse appropriately, or do the things you actually want. Closing that gap — turning a pretrained model into one a user wants to talk to — is post-training, and it's roughly where most of the field's recent capability gains have come from.

The take: pretraining gets the press; post-training does the work. The capability difference between GPT-3 (2020) and a well-aligned modern chat model is mostly post-training, not parameter count. Most teams underinvest here, treating it as a fine-tuning afterthought. The labs that win are the ones that treat post-training as a multi-stage system with its own infrastructure, evaluation, and discipline.

The mental model worth carrying through the rest of this guide: a frontier 2026 post-training run is not a single algorithm but a directed graph of six to ten stages — SFT into preference learning into reasoning RL into a final SFT pass with replay from earlier stages, with safety post-training and constitutional anchors layered on top. SFT, RLHF/PPO, DPO/IPO/KTO/ORPO, GRPO, RLAIF, Constitutional AI, iterated distillation, RLVR — these aren't competing alternatives, they're different tools applied at different stages, sometimes simultaneously. Pretraining is a long single run; post-training is a portfolio of short runs, and the bottleneck is iteration speed, not raw FLOPs.

A second frame: post-training is now the dominant axis along which open-weight models close the gap to closed ones. Llama 3.x, Qwen 2.5/3, DeepSeek-V3/R1, and Tülu 3 are base models — some not even frontier-class on raw pretraining — that approach or match closed frontier models after careful post-training. Pretraining is still the long pole for the highest capability; for most useful workloads, the post-training delta dominates.

Table of contents

  1. Key takeaways
  2. Mental model: post-training in one minute
  3. The post-training landscape in 2026
  4. Why post-training exists
  5. Supervised fine-tuning
  6. The RLHF stack
  7. Reward models and the labeling problem
  8. Reward model training in 2026
  9. PPO at language-model scale
  10. DPO and its relatives
  11. PPO vs DPO vs GRPO — when each wins
  12. Iterative and online preference learning
  13. Iterated post-training: rejection sampling + SFT loop
  14. Constitutional AI and AI feedback
  15. Reasoning fine-tuning and process supervision
  16. Reasoning-specific post-training: R1-Zero, RLVR, process rewards
  17. Mixing stages and ablation discipline
  18. Infrastructure differences from pretraining
  19. Evaluation during post-training
  20. Open problems
  21. Cost and compute budgets for post-training
  22. Safety post-training and red-teaming
  23. Open-source tooling: TRL, OpenRLHF, verl, Axolotl
  24. Common failure modes and recovery
  25. GRPO deep dive: the math, the memory, the gotchas
  26. The preference-data zoo: UltraFeedback, HH-RLHF, Nectar, and friends
  27. Reward hacking taxonomy and mitigations
  28. KL coefficient tuning: worked examples
  29. Verifiable rewards: math, code, and beyond
  30. Process reward models: PRM800K, ProcessBench, and step-level supervision
  31. Safety post-training in depth: Constitutional, deliberative, Llama Guard
  32. Post-training compute economics by stage
  33. PPO vs DPO vs GRPO vs SimPO vs ORPO: full comparison
  34. Mode collapse, length bias, sycophancy: failure-mode catalog
  35. The 2026 post-training playbook
  36. The bottom line
  37. FAQ
  38. Glossary
  39. References

Key takeaways

  • SFT (supervised fine-tuning) is the first stage. Curated instruction-response pairs. Cheap, fast, and the largest single quality jump from a base model.
  • RLHF trains a reward model on human preferences, then optimizes the policy against it with PPO. Powerful, expensive, finicky.
  • DPO and relatives sidestep the reward model and PPO loop by formulating preference learning as a direct loss on the policy. Cheaper, often competitive.
  • Reward models are the bottleneck. Their quality, robustness, and over-optimization behavior largely determine RLHF outcomes.
  • Reasoning post-training (process supervision, verifiable rewards) is the active frontier and the engine behind the 2024-2026 reasoning-model wave.
  • Infra: post-training shares pretraining's distributed-training stack but adds inference for reward models, preference data pipelines, and human/AI labeling infrastructure. It's a multi-system problem.
  • Recommendation: invest in SFT and DPO before chasing full PPO-based RLHF. The marginal quality gain from PPO is real but small relative to the engineering cost.

Quick comparison: post-training methods

Method Data needs Compute cost Stability Best for
SFT Curated (prompt, response) pairs Low Very high Format, style, refusals — first stage
RLHF (PPO) Preference pairs + reward model Very high Low (finicky) Frontier alignment with large label spend
DPO Preference pairs only Low-medium High Most teams; competitive with PPO
IPO / KTO Preferences (KTO needs only binary) Low-medium High Noisy or unpaired feedback data
RLAIF / CAI AI-generated preferences + rubric Medium Medium Scaling labels beyond human throughput
GRPO Verifiable rewards (math, code) High Medium Reasoning models with checkable outputs
Rejection-sampling FT Best-of-N from a reward model Medium Very high Cheap upgrade over plain SFT

Mental model: post-training in one minute

The problem has a name: the alignment tax. A pretrained base model is fluent but unhelpful — it completes prompts plausibly, ignores instructions, refuses nothing, and shifts register at random. Post-training makes it helpful, but every stage of helpfulness shaping (RLHF, DPO, safety SFT, refusal training) trades a small slice of raw capability for a much larger slice of usefulness. The job is to keep the tax small while extracting the usefulness gain.

The cleanest analogy is a preference compiler: SFT teaches the model the target language (instruction-following format); the reward model defines the spec (what humans actually like); PPO/DPO/GRPO is the optimizer that compiles policy weights against that spec. Each stage either learns the target distribution (SFT, rejection-sampling FT) or shapes the policy toward it (preference learning, RL).

Aspect Base model only Base + full post-training
Instruction following Inconsistent Reliable
Refusals on unsafe prompts Rare Calibrated
Style and format Drifts Stable
Helpfulness on chat Low High
Raw capability on probing tasks Slight edge Small tax
Production deployable No Yes

The production one-liner depends on which trade-off you want. With trl:

from trl import DPOTrainer, PPOTrainer
# DPO: pairs only, no reward model, no rollouts
dpo = DPOTrainer(model=policy, ref_model=ref, beta=0.1,
                 train_dataset=pref_pairs)
dpo.train()
# PPO: full RLHF — rollouts + reward model + KL penalty
ppo = PPOTrainer(model=policy, ref_model=ref, reward_model=rm,
                 kl_coef=0.05)
for batch in prompts:
    completions = ppo.generate(batch)
    rewards = rm.score(batch, completions)
    ppo.step(batch, completions, rewards)

The sticky number: DPO matches PPO within 0.3 MT-Bench points at roughly 10× less compute (Rafailov et al., 2023 and replications). That number is why most teams should start with DPO and only invest in PPO when DPO plateaus on workload-specific evals.


The post-training landscape in 2026

The post-training space has bloomed into a zoo of methods. Most teams encounter them as a confusing list of acronyms. The fastest way to make sense of it is to organize them by what objective they are optimizing and what signal they consume.

The method zoo, organized

Supervised stage (imitation).

  • SFT — imitate curated (prompt, response) examples. Cross-entropy on next-token prediction. The first stage of every modern post-training pipeline.
  • Rejection-sampling fine-tuning (RFT, "RSFT") — generate N candidates per prompt with the current model, keep the best (by reward model or verifier), and SFT on the survivors. The simplest "RL-flavored" method. Iterated RSFT is the workhorse of frontier post-training in 2026 because it composes cleanly with the rest of the SFT infrastructure.

Reward-model RL (RLHF family).

  • PPO — the original RLHF algorithm. Schulman et al., 2017 (arXiv:1707.06347). Used in InstructGPT (Ouyang et al., 2022 — arXiv:2203.02155) and most pre-2024 RLHF pipelines.
  • GRPO — Group Relative Policy Optimization. DeepSeek's simplification of PPO that removes the critic by using group-relative advantages over multiple rollouts per prompt. Shao et al., 2024 (arXiv:2402.03300). Now the dominant RL algorithm in published reasoning recipes.
  • REINFORCE / RLOO / ReMax — even simpler variance-reduction variants. Used in some open-source pipelines.

Direct preference optimization (reward-model-free).

  • DPO — Direct Preference Optimization. Rafailov et al., 2023 (arXiv:2305.18290). Reformulates preference learning as a closed-form loss on the policy. No reward model, no rollout loop.
  • IPO — Identity Preference Optimization. Azar et al., 2023 (arXiv:2310.12036). Addresses DPO's tendency to overfit on deterministic preferences.
  • KTO — Kahneman-Tversky Optimization. Ethayarajh et al., 2024 (arXiv:2402.01306). Uses unpaired binary feedback ("this response is good" or "bad") rather than ranked pairs — much easier to collect at scale.
  • ORPO — Odds Ratio Preference Optimization. Hong et al., 2024 (arXiv:2403.07691). Folds SFT and preference learning into a single loss; skips the reference model entirely.
  • SimPO, CPO, sDPO, Iterative DPO — a long tail of refinements addressing specific DPO failure modes.

AI-feedback variants.

  • RLAIF — Reinforcement Learning from AI Feedback. Lee et al., 2023 (arXiv:2309.00267). Replace human labelers with a model judge.
  • Constitutional AI — Bai et al., 2022 (arXiv:2212.08073). A specific RLAIF recipe with an explicit written constitution governing the judge.
  • Self-Rewarding — Yuan et al., 2024 (arXiv:2401.10020). The model judges its own outputs and uses those judgments as the reward signal for its own training. A blurring of generator and reward model.

Verifiable-reward RL (the reasoning track).

  • RLVR — Reinforcement Learning with Verifiable Rewards. The umbrella term for skipping the reward model entirely and using ground-truth checks (test suites, equation solvers, formal verifiers) as the reward signal. Best exemplified by DeepSeek-R1 (DeepSeek-AI, 2025 — arXiv:2501.12948).
  • R1-Zero-style — pure RL from a base model with no SFT cold start. Shows that long-chain reasoning behavior can emerge from RLVR alone. Practically, almost everyone still does a small SFT cold start because it stabilizes early training.
  • Process Reward Models (PRMs) — Lightman et al., 2023 ("Let's Verify Step by Step", arXiv:2305.20050). Reward each reasoning step, not just the final answer. Best when outcome supervision is too sparse.

Self-improvement and distillation as post-training.

  • Iterated Distillation — generate from a strong model, filter, and train a (possibly smaller) student on the survivors. Often the cheapest way to close a gap. Tightly intertwined with synthetic data and distillation.
  • Self-play and self-rewarding loops — generator and judge are the same model family; data flywheel without humans in the inner loop.

What the frontier labs actually do

Public information is incomplete and changes fast, but the rough shape of each lab's post-training stack as of 2026:

  • OpenAI. Started the field with InstructGPT (SFT + PPO RLHF). The o-series reasoning models layer RLVR on top of a heavily post-trained chat model, with proprietary process-style supervision. Heavy investment in human red-team data, AI-feedback for scale, and capability-specific fine-tunes.
  • Anthropic. Constitutional AI is the public signature: a written constitution drives an RLAIF judge. Recent Claude generations layer reasoning RL on top. Strong emphasis on safety post-training as a first-class stage rather than a final tweak.
  • Google DeepMind. Gemini's post-training is the least publicly documented of the big four, but visible signals point to large-scale RLHF, AI feedback, and reasoning-specific RL with verifiers — likely with internal infrastructure inherited from the AlphaGo/AlphaZero lineage.
  • DeepSeek. The most transparent of the frontier labs in 2024–2025. Public recipes for V3 and R1 describe GRPO, verifiable rewards on math and code, an R1-Zero ablation showing pure-RL emergence, and distillation from R1 into a family of smaller open-weight models.
  • Meta (Llama). Public Llama 3 recipe describes a multi-stage pipeline: SFT → rejection-sampling FT → DPO → iterated DPO, with heavy investment in instruction data quality and AI-feedback judges.
  • Allen Institute (Tülu 3). The most thoroughly documented open recipe of 2024 (Lambert et al., 2024 — arXiv:2411.15124): SFT → DPO → RLVR. Worth reading end-to-end as a reference implementation.

Reward model variants

The "reward model" abstraction has fragmented:

  • Bradley-Terry pairwise RMs — the classical reward model, trained on (chosen, rejected) pairs with a logistic loss. Still the default for RLHF.
  • Pointwise / regression RMs — score absolute quality on a scale. Used when labels are absolute, not pairwise.
  • Generative reward models — an LLM that writes a critique and a score. Better calibrated on complex queries; slower at inference time.
  • Process Reward Models (PRMs) — score intermediate reasoning steps.
  • Verifiable reward "models" — not models at all: code executors, symbolic math checkers, theorem-prover oracles. Zero approximation error inside their domain.
  • Reward model ensembles — multiple RMs combined to reduce reward hacking, often with uncertainty estimates that gate optimization.

The frontier trend is to use verifiable rewards wherever possible (math, code, structured tasks), fall back to PRMs for step-level reasoning supervision, and use generative reward models or constitutional judges for everything subjective. The classical Bradley-Terry RM remains useful but is increasingly one signal among several rather than the load-bearing component.


Why post-training exists

A pretrained language model is trained to predict the next token on web text. It is good at producing plausible text continuations. Plausibility is not usefulness.

Concretely, a base model will:

  • Continue a question with another question (the most "likely" continuation of "How do I cook rice?" on the internet may be more questions or a list of recipes, not a direct answer).
  • Refuse nothing — including content the operator doesn't want generated.
  • Mirror the style of whatever surrounding text exists.
  • Sometimes generate the answer; sometimes not.

Post-training shapes the model into something usable: instruction-following, capable of refusal, aware of the conversational frame, calibrated about uncertainty, aligned with operator intent.

Roughly: pretraining gives capability; post-training gives interface. The compute profile is also very different — SFT and DPO often fit on a single node with mixed-precision training, while pretraining requires multi-rack clusters.

The interface matters enormously for end-user value. Most users will never see the capability of a model that has a bad interface. This is why post-training is where so much of the recent practical progress sits.


Supervised fine-tuning

The first stage of post-training is supervised fine-tuning (SFT). Same training procedure as pretraining (cross-entropy loss on next-token prediction), different data.

What the data looks like

Pairs of (prompt, response). Curated, typically by humans, sometimes by other models. Examples:

prompt: "How do I cook white rice?"
response: "1. Rinse the rice... 2. Add water in a 1:2 ratio... 3. Bring to a boil..."

The model is trained to produce the response given the prompt. After enough examples, it learns the general pattern: a prompt should be answered.

What SFT data looks like at scale

Modern SFT datasets contain hundreds of thousands to millions of examples, spanning:

  • Instruction-following ("Write an email asking for a refund")
  • Conversational turn-taking
  • Refusal templates
  • Structured outputs (JSON, code, lists)
  • Reasoning patterns (chain-of-thought traces)
  • Domain-specific styles (legal, medical, coding)

The composition of the SFT mix is one of the more closely-guarded parts of any lab's recipe. The specific mix and ordering matter substantially. A growing share of that mix is synthetic data and distillation traces generated by larger teacher models, and the resulting student is typically served behind a reasoning-model serving stack and benchmarked with dedicated eval infrastructure.

Quality matters more than quantity

The dominant finding across the literature: a smaller, higher-quality SFT dataset usually beats a larger, lower-quality one. The "LIMA: Less Is More for Alignment" paper (Zhou et al., 2023) made this concrete with ~1000 carefully-curated examples performing competitively with much larger datasets.

The practical implication: invest in data curation before chasing data volume.

What SFT can and can't do

Can: teach formats, styles, refusal patterns, basic instruction following. Cover the main use cases.

Can't: optimize against subtle quality differences a human prefers. The training signal is just "match this response," which doesn't capture why one response is better than another.

For the harder quality work, you need preference learning.

SFT hyperparameter cheat sheet

The hyperparameters that matter most in SFT, and the values that tend to work in 2026 for 7B–70B-class models:

Hyperparameter Typical range Notes
Learning rate 1e-6 to 5e-6 (full-param), 1e-4 to 3e-4 (LoRA) Smaller is safer at larger model sizes. Linear or cosine warmup over 3–10% of steps.
Batch size (tokens) 1M–4M tokens per step Large enough that gradient noise is dominated by data diversity, not single-example artifacts.
Epochs 1–3 More than 3 epochs overfits and degrades held-out quality.
Sequence length 4K–32K Match the deployment context. Longer sequences need long-context attention tricks.
Loss masking Mask the prompt Train only on response tokens. Otherwise the model learns to predict the user's words too.
Packing Sample-packed with attention masks Pack multiple short examples into one sequence to amortize the padding waste.

These ranges are not universal — every base model and every data mix has its own sweet spot — but they are the right starting point for a first SFT pass, and most teams burn weeks rediscovering them from scratch.

How SFT differs from continued pretraining

A subtle distinction worth being explicit about. Continued pretraining feeds long documents at the same loss and learning rate as the original pretraining run, with the goal of injecting new knowledge or shifting the model's data distribution. SFT feeds (prompt, response) pairs with the prompt masked, at a much lower learning rate, with the goal of teaching the model an interface. The two are easy to confuse because both look like "fine-tune on more text," but they do different things and require different data, learning rates, and evaluation. Most "fine-tuning failed" stories trace back to using continued-pretraining hyperparameters on SFT data, or vice versa.


The RLHF stack

Reinforcement Learning from Human Feedback is the canonical recipe that took GPT-3 to InstructGPT to ChatGPT. The original three-stage pipeline:

  1. SFT: as described above.
  2. Reward model training: collect pairs of (prompt, response_A, response_B) with human preferences (A > B or B > A). Train a reward model to predict the preference.
  3. PPO: optimize the policy (the language model) to maximize the reward model's score, regularized to stay close to the SFT model.

Stage 2 in detail

Humans look at two model responses to the same prompt and pick which is better. This is much easier than writing a perfect response from scratch — comparisons are usually more reliable than absolute ratings.

Preference data is then fed to a reward model — typically initialized from the SFT model with the language-modeling head replaced by a scalar prediction head. The reward model learns to score (prompt, response) pairs.

Stage 3 in detail

The policy (initially the SFT model) generates responses. The reward model scores them. PPO updates the policy to increase expected reward, with a KL-divergence penalty that keeps the policy from drifting too far from the SFT model.

The KL penalty is crucial: without it, the policy can find ways to maximize reward that the reward model is mis-specified about (reward hacking).

Why this works

The policy gets feedback on the quality of its actual outputs, not just on matching reference responses. It can learn preferences too subtle to capture in SFT data (calibration, nuance, refusal precision).

Why it's hard

PPO is finicky. The reward model is approximate. The KL penalty must be tuned. The whole loop is computationally expensive — multiple forward passes per training step across policy, reward model, and reference model.

These problems are part of why DPO and its relatives emerged.

A worked PPO example

To make the moving parts concrete: a 70B PPO run with batch size 512 prompts, rollout length 1024 tokens, KL coefficient 0.05, learning rate 1e-6, clip range 0.2. Each PPO step requires (a) generating 512×1024 = 524K rollout tokens with the policy ($8 of H100 time at typical throughput), (b) scoring all 512 responses with a reward model ($0.50 if the RM is 7B), (c) computing reference-model logprobs over the same 524K tokens ($3), (d) training-step forward and backward on the policy and critic ($15). A single step is on the order of $25–$50 in pure compute; a full run of 10K steps lands at $250K–$500K. Most of that cost is the rollout, which is why making rollouts cheap — via vLLM-style continuous batching, prefix caching across same-prompt rollouts, and speculative decoding — is the highest-leverage optimization in any production PPO stack.

The KL coefficient is the most important knob

If a single hyperparameter has to be tuned by hand in PPO, it is the KL coefficient β. Too low and the policy drifts off the SFT reference, exploits the reward model, and produces gibberish that scores well. Too high and the policy never moves and the run is wasted compute. The right value depends on the reward model's calibration, the rollout length, and the data distribution; published recipes use values from 0.01 to 0.2. The pragmatic approach is adaptive KL — increase β when the running KL exceeds a target, decrease it when KL is well below target — which most production stacks now implement by default.


Reward models and the labeling problem

The reward model is the single most important component in RLHF, and the most failure-prone.

Reward hacking

The reward model is an imperfect proxy for human preference. The policy will find inputs where the reward model's score is high but actual human preference is not. The classic example: the policy learns to generate responses with certain stylistic markers (long, confident-sounding, well-formatted) that the reward model rewards regardless of accuracy.

Mitigations:

  • KL penalty to the reference model (constrains the policy from drifting far).
  • Reward model regularization (clip the rewards, ensemble multiple reward models).
  • Periodic re-labeling and reward-model retraining as the policy distribution shifts.

None of these fully solve it. Reward hacking is a structural problem.

Labeling cost

High-quality preference labels are expensive. Human labelers must be carefully trained, given consistent instructions, and quality-checked. Inter-rater agreement at scale is typically 70-85%, depending on task.

For frontier post-training, labeling budgets run into the millions of dollars per training run. The bottleneck is human throughput, not compute.

AI labeling

The rise of strong LLMs has made model-generated labels viable for many tasks. "RLAIF" (Reinforcement Learning from AI Feedback) replaces human labelers with another model. Quality varies; for some tasks it matches human labels, for others it doesn't.

The honest position is hybrid: humans for the highest-stakes preferences and constitutional anchors, models for the bulk volume.

Distribution mismatch

The reward model is trained on labels from one distribution of (prompt, response) pairs. The policy, once optimized, generates responses from a different distribution. The reward model's calibration on the new distribution may be poor.

This is why iterative RLHF (next section) does multiple rounds of labeling and reward-model updates.


Reward model training in 2026

The reward model has gone from a single component to a small ecosystem of complementary signals. A modern frontier RM stack typically includes several of these working in parallel.

Architectural choices

The classical RM is a transformer initialized from the SFT checkpoint with a scalar regression head replacing the language-model head. Trained with a Bradley-Terry pairwise loss on (chosen, rejected) pairs. This still works.

Variants in active use as of 2026:

  • Generative RMs (LLM-as-judge with structured output). The reward model is itself a full LLM that produces a written critique followed by a score in a structured format. Slower than a scalar head but substantially better calibrated, particularly on complex queries where a single scalar collapses too much information. Often the same base model used for the policy, fine-tuned on judgment data.
  • Multi-head RMs. A single backbone with several scalar heads — helpfulness, harmlessness, factuality, refusal-appropriateness — each trained on its own preference data. Allows downstream RL to combine signals with explicit weights.
  • Process Reward Models (PRMs). Score intermediate steps in a reasoning chain rather than the final answer. Trained on step-level labels from human or AI annotators. Lightman et al., 2023 (arXiv:2305.20050) showed PRMs substantially outperform outcome-only RMs on math reasoning.
  • Pairwise RMs vs pointwise RMs. Pairwise is more robust (annotators agree better on "A is better than B" than on absolute scores). Pointwise is more flexible (single examples can be scored without a comparison partner). Most production stacks use pairwise for training and pointwise calls at inference time.

Training data composition

A frontier RM training set typically combines:

  • Human preference pairs on representative prompts (the gold standard, smallest in volume).
  • AI-feedback preference pairs (cheap, large volume, varying quality).
  • "Constitutional" judgments from a structured judge against a written rubric.
  • Verifiable signals (test-case pass/fail, math correctness) as ground-truth labels for the domains they cover.
  • Adversarial examples specifically constructed to expose known reward-hacking patterns from prior policies.

Reward model evaluation

Just because an RM has low loss on its training data does not mean it produces a useful RL signal. The 2026 best practice is to evaluate the RM separately:

  • RewardBench-style suites — held-out preference pairs across categories (chat, reasoning, safety, code) with known correct answers.
  • Best-of-N agreement. Sample N responses per prompt with the policy, pick the RM's argmax, compare against human or verifier judgment.
  • Reward-hacking probes. Inputs known to be over-rewarded by naive RMs (excessive length, hedging, refusal templates, markdown formatting). Track whether the RM treats them sensibly.
  • Calibration on the policy distribution. Periodically re-evaluate the RM on samples from the current policy, not on the original training distribution. Drift is the leading indicator that an RM is about to start producing nonsense gradients.

Ensembling and uncertainty

Multiple independently-trained RMs disagree on some examples. That disagreement is signal. Production RL stacks use:

  • Ensemble averaging. Mean reward across an ensemble. Reduces variance.
  • Uncertainty gating. When ensemble variance is high, downweight or skip the gradient — the policy is in a region the RM doesn't reliably score.
  • Pessimistic RMs. Reward = ensemble mean minus a multiple of standard deviation. Discourages the policy from exploring regions the RM is uncertain about, the same trick that constrained-policy RL has used for years.

The trajectory of the field: the classical Bradley-Terry reward model is becoming one signal in a multi-signal optimization rather than the single source of truth. Verifiable rewards take over where they apply; generative judges take over where calibration matters more than throughput; multi-head and ensemble RMs handle the remaining bulk.


PPO at language-model scale

Proximal Policy Optimization is the RL algorithm typically used. It alternates between collecting rollouts (the policy generates responses to prompts) and updating the policy.

Per-step infrastructure

A PPO step requires:

  • The policy to generate responses.
  • The reward model to score them.
  • The reference model (frozen SFT) to compute KL.
  • A critic (value function), often co-located with the policy.

That's 3-4 model forward passes per token of generation, plus backward passes on the policy and critic.

For a frontier-scale post-training run, this is expensive — roughly comparable to the SFT phase in compute, sometimes more.

Stability and hyperparameters

PPO is notoriously hyperparameter-sensitive. Learning rate, KL coefficient, clip range, batch size, rollout length — all matter. A misconfigured run can produce a policy that's worse than SFT.

Practical heuristics from the literature:

  • Start with a small KL coefficient and scale up if reward hacking appears.
  • Use a longer rollout per prompt for stability.
  • Keep the reward model from over-fitting to early rollouts (mix in old labels).

Alternatives within RL

  • GRPO (Group Relative Policy Optimization): used in DeepSeek-V3 and related work; simpler than PPO, fewer auxiliary models.
  • REINFORCE++ and other simplifications that reduce variance.
  • Online DPO (next section): blurs the line between RL and supervised approaches.

DPO and its relatives

Direct Preference Optimization (DPO; Rafailov et al., 2023) reformulates preference learning as a direct loss on the policy. No reward model, no PPO loop.

The DPO loss

Given a pair (prompt, chosen_response, rejected_response), the DPO loss pushes the policy to assign higher probability to the chosen response relative to the rejected one, scaled relative to a reference policy.

The result: a single forward-backward pass per training step, similar in cost to SFT. No reward model. No RL infrastructure.

How good is it

The literature is mixed but encouraging. DPO often matches or approaches PPO-based RLHF on standard benchmarks, at substantially lower engineering cost. It's particularly strong when:

  • Preference data is plentiful and high-quality.
  • Reward hacking would be a problem with PPO.
  • Engineering simplicity matters (open-source labs, smaller teams).

It's weaker when:

  • Iterative refinement is needed (multiple rounds of generation-and-labeling).
  • The KL constraint to the reference is the dominant signal (DPO's regularization is implicit).

DPO variants

  • IPO (Identity Preference Optimization): more conservative variant addressing some DPO overfitting failures.
  • KTO (Kahneman-Tversky Optimization): uses unpaired feedback (just "good" or "bad" responses).
  • SimPO (Simple Preference Optimization): drops the reference model term.

The practical choice

For most teams: SFT followed by DPO is the right starting point. PPO becomes worth the engineering investment when DPO plateaus or when the workload requires iterative refinement with stable training dynamics.

DPO's hidden failure modes

DPO is stable in the sense that loss curves are smooth, but it has subtle failure modes that don't show up in the loss. The most common one is margin collapse: the loss is driven down by lowering the probability of the rejected response rather than raising the probability of the chosen one. The result is a policy that knows what not to say but has no positive signal about what to say, and produces incoherent or evasive outputs at inference. The fix is to track chosen-logprob and rejected-logprob separately during training — if chosen-logprob is falling along with rejected-logprob, the run is failing even though the loss looks healthy. SimPO (Meng et al., 2024) and ORPO (Hong et al., 2024) address this with reference-free reformulations; cDPO and conservative DPO variants address it with explicit regularization toward the chosen response.

A second failure mode is length bias amplification. DPO's implicit reward is monotonic in log-probability, and longer responses tend to have lower per-token log-probability. Without an explicit length normalization, DPO can systematically prefer shorter responses, which interacts badly with reasoning workloads where longer is often better. Most production DPO implementations now include length-normalized loss as a default.

β, the DPO temperature

The DPO loss has a single hyperparameter, β, that controls the strength of the implicit KL constraint to the reference. Higher β keeps the policy closer to the reference; lower β allows more aggressive optimization at the cost of stability. Published recipes use β in the 0.01–0.5 range; the right value scales inversely with how confident you are in the preference data. Noisy or AI-generated preferences want higher β; clean human preferences on hard tasks want lower β. The Tülu 3 recipe uses β around 0.1, the original DPO paper used 0.1–0.5, and OpenRLHF defaults to 0.01 — the range matters and there is no universal answer.

Post-training with RLHF and DPO in 2026 — infographic covering the alignment goal (helpful, harmless, honest, respectful), the workflow from pretrained model through human preferences to aligned model, RLHF's three steps (collect comparisons, train reward model, reinforcement learning), DPO's direct preference optimization with a closed-form objective, an RLHF vs DPO comparison table across approach, components, training loop, compute, stability, and best fit, and good practices around diverse preference data, safety coverage, reward-hacking monitoring, and human-in-the-loop evals.
Post-training at a glance. Pretraining gives a model knowledge; post-training gives it alignment — helpful, harmless, honest, respectful. RLHF runs three stages: collect human comparisons, train a reward model, fine-tune the policy with RL (PPO) against that reward. DPO collapses the same goal into a single closed-form objective on chosen / rejected pairs — no reward model, no RL loop, simpler and more stable. RLHF wins on maximum control and complex behaviors; DPO wins on most use cases with lower compute and higher stability. Good practice: use diverse preference data, cover safety / factuality / helpfulness / style, monitor for reward hacking and over-optimization, keep a strong eval suite with human-in-the-loop, and iterate continuously.

PPO vs DPO vs GRPO — when each wins

The three algorithms cover most of the RL/preference-learning landscape in 2026. They are not interchangeable. Choosing between them is one of the higher-leverage decisions in a post-training plan.

What they actually do, in one line each

  • PPO. On-policy actor-critic RL. Generate rollouts, score them with a reward model, update the policy with a clipped policy-gradient surrogate, regularize with a KL penalty against a frozen reference. Four models live in memory: policy, critic, reward model, reference.
  • DPO. A closed-form supervised loss on (chosen, rejected) pairs that is mathematically equivalent to optimizing against an implicit reward model derived from the policy's own log-probability ratios against a frozen reference. Two models in memory: policy, reference.
  • GRPO. PPO without the critic. For each prompt, sample a group of G rollouts, score each with a reward (often a verifiable reward), and use the group-relative advantage (reward minus group mean, normalized by group std) as the policy gradient signal. KL against a reference is retained. Two models in memory: policy, reference.

Memory and throughput

PPO is the heaviest. A 70B policy implies ~280GB just for the policy weights in FP16, plus the critic (similar size, usually), plus the reward model (often smaller but still substantial), plus the frozen reference. Realistic frontier PPO runs require dozens of nodes and sophisticated distributed training plumbing.

DPO drops the critic and reward model. Memory profile is comparable to SFT plus one frozen copy of the policy for the reference. The reference can be partitioned cheaply or even computed once and cached if the dataset is small enough.

GRPO sits in between. No critic, no reward model in memory (reward is a verifier or an already-computed RM forward pass), but rollouts are expensive: G samples per prompt at inference cost.

Stability and sample efficiency

PPO is the most powerful and the most finicky. With good infrastructure and tuning, it produces the best results in the regime where iterative refinement against a reward model is the right objective. With bad tuning, it produces a reward-hacked policy that scores high on the RM and is useless to users.

DPO is the most stable and the easiest to ship. The closed-form loss has no rollout variance, no critic, no clip-range sensitivity. The downside: the implicit reward model is the policy itself relative to the reference, which means DPO cannot easily benefit from a separately-trained, higher-quality reward signal.

GRPO is more stable than PPO (no critic to misestimate) and more powerful than DPO when rollouts are cheap enough and rewards are reliable enough. The sweet spot: verifiable rewards on math and code, where the reward signal is exact and the policy can be trained directly on group-relative advantages.

When each wins, concretely

Use SFT alone when you are early in a project, when you do not yet have a preference dataset, or when the workload is well-specified by example responses (format, style, simple instruction-following).

Use DPO when you have a preference dataset, no infrastructure for online RL, and want a stable, cheap method that captures most of the RLHF quality gain. This is the right default for the vast majority of teams. Iterative DPO — re-collecting preferences on the trained policy and retraining — extends the ceiling substantially.

Use PPO when DPO has plateaued, when you have invested in a high-quality reward model that you trust more than the implicit DPO signal, when iterative refinement against a reward model is the bottleneck, or when you are doing safety post-training where the KL penalty's behavioral guarantees matter. Frontier labs still use PPO for subjective objectives where a careful reward model outperforms direct preference learning.

Use GRPO when your reward signal is verifiable (math, code, structured tasks) or when you have a strong reward model and can afford G rollouts per prompt. This is the dominant choice for reasoning post-training as of 2026, since it preserves PPO-style on-policy benefits while halving the memory budget and removing critic-instability failure modes.

Use a combination in any frontier-quality stack. The published Tülu 3 recipe (Lambert et al., 2024 — arXiv:2411.15124) uses SFT → DPO → RLVR (a GRPO-flavored stage). The published DeepSeek-R1 recipe uses SFT → GRPO with verifiable rewards, then a final round of SFT for clean-up. Llama 3's described recipe is SFT → rejection-sampling FT → DPO → iterated DPO. The common pattern: start cheap and supervised, escalate to RL where the marginal capability is worth the engineering cost.

A practical heuristic

If you can't articulate why your reward model would outperform the implicit DPO signal, you don't need PPO. If your task has a verifier, you should be using GRPO or another verifiable-reward RL method, not chasing a learned reward model. If neither of those applies, SFT + DPO is the right answer until proven otherwise.


Iterative and online preference learning

Both PPO and DPO can be run iteratively: train, collect new preferences on the updated policy, retrain, repeat.

The reason: as the policy improves, the distribution of its outputs shifts. Old preference labels become less informative. New labels on the current policy's outputs are needed.

Iterative RLHF

  • Round 1: train reward model on initial labels, run PPO.
  • Round 2: collect new labels on the updated policy's outputs, retrain reward model, continue PPO.
  • ... and so on.

Each round costs labeling budget. Diminishing returns set in eventually. Typical production: 3-10 rounds.

Online DPO

Continuously generate new responses, label them (via humans or AI), and feed into DPO training. Tighter loop than iterative DPO.

Why iteration matters

A single round of preference learning gets you a model that does well on the initial label distribution. Multiple rounds get you a model that does well on a distribution closer to the actual deployed policy. This is much more useful in practice.


Iterated post-training: rejection sampling + SFT loop

The single most underrated recipe in modern post-training is the rejection-sampling SFT loop. It is conceptually simple, infrastructurally cheap, and surprisingly close in quality to full RL when the reward signal is good.

The loop

  1. Start from the best current model checkpoint.
  2. For each prompt in a training set, sample N candidate responses (typical N: 8 to 64).
  3. Score each candidate with a reward model, a verifier, or a panel of judges.
  4. Keep only the top-K candidates per prompt (often K=1, the best-of-N).
  5. SFT the model on the surviving (prompt, response) pairs.
  6. Go to step 1.

This is what Meta's published Llama 3 recipe describes as rejection-sampling fine-tuning, what OpenAI has described as expert-iteration-style training, and what DeepSeek's recipes use between RL stages. It is also the dominant technique for distilling a stronger model into a weaker one within the same family — see synthetic data and distillation for the related but distinct teacher/student version.

Why it works

The reward signal selects examples the policy is capable of producing but does not yet produce reliably. Training on those examples raises their probability under the policy. Each round moves the policy's mode toward the high-reward region without ever leaving the supervised-learning regime. No critic, no rollout variance, no KL gymnastics. Just SFT on selectively-filtered data.

Why it's cheap

The inference for rollouts is parallel and embarrassingly batchable. The training step is plain SFT. No new infrastructure required beyond what every team already has.

Why it's not a replacement for RL

The catch: rejection sampling can only amplify behaviors the policy already produces with non-trivial probability. If a desired behavior is outside the policy's current support, no amount of best-of-N sampling will surface it. RL with on-policy exploration can in principle discover behaviors that SFT-on-best-of-N cannot reach.

In practice, the two compose. A typical frontier pipeline alternates: rejection-sampling SFT to consolidate gains, then a round of RL (PPO or GRPO) to push the frontier outward, then more rejection-sampling SFT to consolidate the RL gains in a stable form.

Cost-quality position

A useful intuition: rejection-sampling SFT recovers something like 70-90% of the quality gain that full RL would deliver, at 10-30% of the engineering cost. For most teams below the frontier, that is the right trade. The frontier labs continue to use it as a backbone, even though they also run full RL — because it stabilizes the pipeline and gives them a clean checkpoint to fall back to whenever an RL run goes off the rails.


Constitutional AI and AI feedback

Constitutional AI (Anthropic, Bai et al., 2022) uses the model itself, given a "constitution" of principles, to provide feedback for training.

The pipeline

  1. Supervised CAI stage: the model critiques its own responses against the constitution, then revises them. The revised responses become SFT data.
  2. RL from AI feedback (RLAIF): a model judges pairs of responses against the constitution. The preference labels train a reward model. Then standard RLHF.

The result: most of the labeling is automated. Humans write the constitution and audit the process, but don't label every preference pair.

Why this matters

  • Scale: AI labels are cheap relative to human labels. Larger preference datasets become feasible.
  • Consistency: humans disagree; a constitution-following AI labeler is more consistent (for better or worse).
  • Transparency: the constitution is explicit, auditable, and editable.

Why it's not magic

  • The constitution-following model still has biases.
  • Quality of AI labels varies by task.
  • "Constitution" is a useful abstraction but doesn't capture all the implicit preferences in human feedback.

Most production systems in 2026 use some mix of human labels (for the highest-stakes anchors) and AI labels (for bulk).


Reasoning fine-tuning and process supervision

The most active frontier in post-training is reasoning — training models to produce explicit step-by-step reasoning, often with verifiable rewards.

The basic idea

Standard RLHF rewards the final answer. For tasks with verifiable answers (math, code), you can reward the answer directly without a reward model — just run the test cases or check the math.

This is "verifiable rewards" or "outcome supervision." It removes the reward-hacking problem because the reward is the ground truth.

Process supervision

A more aggressive version: reward each step of the reasoning, not just the final answer. The reward model evaluates intermediate steps for plausibility / correctness.

Why it matters: a model can get the right answer for the wrong reasons. Process supervision pushes the reasoning itself to be valid, not just the conclusion.

Inference-time impact

Reasoning post-training also changes inference. Models trained to "think out loud" generate long chains of thought before answering. Inference-time compute becomes a tunable knob — more thinking, better answers, more cost.

This is the foundation for "test-time compute" scaling (see our reasoning model serving guide).

What's driving the 2024-2026 wave

OpenAI's o1, DeepSeek's R1, Anthropic's reasoning modes — all reflect this paradigm. The exact recipes are proprietary but the published work points to:

  • Process supervision via reward models trained on step-level labels.
  • Verifiable reward training on math and code.
  • Iterative bootstrapping from base models with long chains of thought.

Reasoning-specific post-training: R1-Zero, RLVR, process rewards

The single highest-leverage post-training development of 2024–2025 was the public realization that long-chain reasoning behavior can be elicited from base models with reinforcement learning against verifiable rewards alone — no preference data, no human judgments, no reward model. This is what DeepSeek-R1 (DeepSeek-AI, 2025 — arXiv:2501.12948) demonstrated publicly and what most credible reports indicate is happening inside the closed frontier labs in some form.

R1-Zero: pure RL from a base model

The R1-Zero ablation in the DeepSeek paper is the most striking result of the past two years. Starting from a base model (DeepSeek-V3-Base) with no SFT cold start, running GRPO with verifiable rewards on math and code, the model develops emergent long-chain reasoning behavior. It learns to backtrack, to verify its own intermediate steps, to spend more tokens on harder problems — none of which was directly rewarded. The reward was only "did you get the right answer."

The practical caveat: R1-Zero's outputs are sometimes hard to read (mixed languages, idiosyncratic formatting). The production R1 recipe layers a small SFT cold start and a final SFT/RLHF stage on top to clean up the format. But the central finding — that RLVR alone produces reasoning behavior from a base model — has reshaped how labs think about the role of SFT in reasoning training.

The RLVR pipeline

A typical RLVR stage in 2026 looks like:

  1. Start from a base or instruct model.
  2. Curate a large prompt set with ground-truth answers — math problems with known solutions, coding problems with test suites, structured tasks with checkers.
  3. For each prompt, generate G rollouts (typical G: 8 to 64) at high temperature.
  4. Run the verifier on each rollout. Reward = 1 if correct, 0 otherwise (with optional shaping for format compliance and length penalties).
  5. Update the policy with GRPO using group-relative advantages.
  6. Maintain a KL penalty against a frozen reference (often the base model) to prevent drift on capabilities outside the reasoning domain.

Throughput is the main engineering challenge: rollouts are long (thousands of tokens of chain-of-thought) and need to run on serving infrastructure capable of high-batch inference. Many teams co-locate the rollout cluster with the training cluster and use speculative decoding or other inference optimizations to keep the rollout phase from dominating wall time.

Process rewards vs outcome rewards

The R1-style approach uses outcome rewards only — correct answer or not. Process Reward Models (Lightman et al., 2023 — arXiv:2305.20050) instead score each reasoning step. The tradeoff:

  • Outcome rewards are cheap, unambiguous, and immune to reward hacking inside the verifier's domain. But they are sparse — most reasoning chains end in the wrong answer and provide no gradient signal on which step went wrong. They also do nothing for non-verifiable tasks.
  • Process rewards are dense (every step provides signal) and apply to non-verifiable tasks. But they require step-level labels (expensive), can be reward-hacked by producing plausible-looking but vacuous intermediate steps, and the labels themselves are often noisy.

The frontier answer in 2026 is hybrid: outcome rewards as the load-bearing signal where verifiers exist, process rewards as a denser auxiliary signal, with the PRM trained on a mix of human step labels and outcome-induced labels (a step is "good" if rollouts continuing from it succeed at a higher rate).

The relationship to inference-time compute

RLVR-trained reasoning models change inference economics. They learn to spend variable amounts of test-time compute on a problem — short chains for easy problems, very long chains for hard ones. Serving them efficiently requires the patterns covered in reasoning model serving: adaptive token budgets, prefix-aware KV cache management, and routing systems that decide when to invoke the reasoning model at all.

Distillation of reasoning capability

A second R1 finding worth highlighting: once you have a strong reasoning model, you can distill its reasoning traces into a much smaller student model with surprisingly good results. The DeepSeek paper releases a family of distilled smaller models that retain a substantial fraction of the reasoning capability of the full R1 at a fraction of the inference cost. This is the same iterated-distillation pattern discussed under synthetic data and distillation, now applied to reasoning specifically.

Open questions

  • Does RLVR generalize beyond verifiable domains? Public results are most striking on math and code. The hope is that the reasoning skill transfers to soft tasks. The evidence is mixed.
  • How much SFT cold start is necessary? R1-Zero says none; production R1 uses some; OpenAI's reported recipe uses more. The right answer probably depends on the base model and the reward landscape.
  • Are PRMs strictly better than outcome rewards? Public results are inconsistent. Outcome rewards plus enough rollouts may already extract most of the signal a PRM provides.

Mixing stages and ablation discipline

A production post-training pipeline is rarely one stage. Typical:

  1. SFT on a curated dataset.
  2. DPO or RLHF using preference data.
  3. Specialized fine-tuning for capabilities (reasoning, coding, tool use).
  4. Final SFT or DPO pass to fix issues from earlier stages.

Order matters

The literature has documented that stage ordering changes outcomes. Aligning a model toward refusals first vs after capabilities first produces different behavior on edge cases. Recovery is possible but expensive.

Catastrophic forgetting

Later stages can erode capabilities established earlier. A model fine-tuned heavily on math may regress on writing. Mixing in earlier-stage data (replay) during later stages mitigates this.

Ablation discipline

Without careful experimentation, you can't tell which stage is helping and which is hurting. A discipline of:

  • Single-axis ablations (change one thing at a time).
  • Workload-representative evals at every stage boundary.
  • Versioned datasets and reproducible pipelines.

This is mostly engineering discipline, not novel research, but it's what separates teams that ship from teams that thrash.


Infrastructure differences from pretraining

Post-training shares some infrastructure with pretraining but adds:

Inference during training

The policy must generate responses. The reward model must score. These are inference workloads embedded in a training loop. Serving infrastructure has to coexist with training infrastructure.

Preference data pipelines

Collecting preferences, validating them, deduplicating, versioning. Smaller-scale than pretraining data pipelines but with tighter quality requirements.

Human-in-the-loop tooling

For SFT and RLHF stages requiring human labels: annotation interfaces, labeler training, quality QA. A significant operational investment.

Reward model serving

During RLHF, the reward model is serving inference at high throughput. Same engineering as production inference, plus the wrinkle that the reward model itself is updated periodically. In practice this co-located rollout-and-RM stack borrows heavily from LLM serving infrastructure, with continuous batching and KV cache reuse across same-prompt rollouts being the highest-leverage optimizations.

Smaller scale

Post-training runs are typically 10-100× smaller than pretraining runs in compute, but more complex in pipeline. The infrastructure profile is different: more inference, more orchestration, more data management.


Evaluation during post-training

Discussed at depth in our eval infrastructure guide. Specific to post-training:

  • SFT eval: instruction-following benchmarks, simple capability checks.
  • Preference eval: pairwise human or model preference vs the previous checkpoint.
  • Safety eval: refusal rates, harmful content checks.
  • Capability regression eval: ensure later stages don't break earlier capabilities.

A common pattern: eval at every stage boundary, gate progression on meeting thresholds, version every checkpoint with its eval suite.

The eval portfolio for a 2026 post-training run

A representative eval suite that production teams run at every checkpoint:

Eval category Examples What it catches
Instruction-following IFEval, AlpacaEval 2, Arena-Hard Whether the model follows the prompt structure and constraints
Reasoning GPQA, MATH, AIME, GSM8K Reasoning capability ceiling and regressions
Code HumanEval+, MBPP+, LiveCodeBench, SWE-Bench Verified Whether code-related post-training is working
Multilingual MGSM, FLoRes, MMLU translated Catches monolingual collapse during SFT
Safety XSTest, HarmBench, do-not-answer Refusal calibration on both ends of the frontier
Calibration TriviaQA-Calib, internal calibration probes Whether the model knows what it doesn't know
Long context RULER, LongBench, needle-in-a-haystack Whether attention is still healthy after fine-tuning
Held-out preference Pairwise vs previous checkpoint Direct measurement of preference improvement

A single number — say, MMLU — is not a sufficient signal. Most published post-training results that look surprising in either direction turn out to involve a single-axis eval and a multi-axis change to the model. The discipline of running the whole portfolio at every gate is what separates teams that ship reliable improvements from teams that ship lucky ones.

Pairwise human eval and its replacement

The gold standard for chat-quality evaluation remains a blinded pairwise human comparison: present a human with two model responses, ask which is better, repeat across hundreds of prompts, compute win rates. This is expensive ($3–$10 per pair, days of wall clock) and slow. In 2026 most teams replace 90% of pairwise human eval with a strong-judge model (typically Claude or GPT-4-class) running the same protocol, and reserve human eval for the final checkpoint of a release cycle. The judge model's agreement with humans is typically 80–90% on chat quality, lower on reasoning, lower still on safety — calibrate the substitution with periodic human spot-checks.


Open problems

Reward hacking at scale. The fundamental problem of approximate reward models hasn't been solved. Methods reduce its severity; none eliminate it.

Calibration. Models trained with RLHF tend to become overconfident. Process for restoring calibration is empirical and partial.

Long-horizon reasoning supervision. Process supervision works on short reasoning chains. Multi-step, multi-tool, multi-hour reasoning is harder to supervise.

Preference elicitation. Eliciting useful preferences from humans (or AI) for novel domains is open. Standard pairwise comparisons capture only some preference dimensions.

Mixing RL with self-play. Models generating their own training data. Promising but quality control is hard.

Cross-model distillation. Training a smaller model from a larger one's outputs. Works well; the limits aren't well understood.


Cost and compute budgets for post-training

The single most useful artifact a post-training plan can produce before the first GPU spins up is an honest budget. The numbers below are 2026 order-of-magnitude figures synthesized from public recipes (Llama 3, Tülu 3, DeepSeek-R1) and current cloud pricing; treat them as the right order of magnitude, not as quotes.

Compute by stage and model size

Stage 7B model 70B model 400B-class model Dominant cost
SFT (1 epoch, 1M examples) ~200 H100-hours ~2,000 H100-hours ~12,000 H100-hours Training FLOPs
DPO (1 epoch, 200K pairs) ~150 H100-hours ~1,800 H100-hours ~11,000 H100-hours Training FLOPs + reference forward
Rejection-sampling SFT (N=16, 500K prompts) ~600 H100-hours ~6,000 H100-hours ~30,000 H100-hours Rollout inference dominates
PPO RLHF (10K steps, BS=512, rollout 1K tokens) ~2,500 H100-hours ~25,000 H100-hours ~150,000 H100-hours Rollout + reward model + critic
GRPO / RLVR (verifiable rewards, G=16, 20K steps) ~3,000 H100-hours ~30,000 H100-hours ~180,000 H100-hours Rollout inference
Reward model training (300K pairs) ~80 H100-hours ~800 H100-hours rarely RM-trained at this size Training FLOPs

At public on-demand rates (~$2.50/H100-hour in mid-2026 on the spot market, ~$4/hour reserved), a full-recipe 70B post-training pass — SFT plus DPO plus a GRPO reasoning stage plus a final SFT clean-up — lands in the $150K–$400K range of pure compute. Frontier labs running iterative pipelines with multiple RL rounds, ensemble reward models, and ablation sweeps spend 10–50× that. The headline takeaway: post-training compute is one to two orders of magnitude cheaper than the underlying pretraining run, and labeling plus engineering time typically outweigh GPU spend.

Labeling and data costs

Human preference labels at production quality cost roughly $0.50–$3 per pairwise comparison depending on domain (general chat at the low end, code or domain-expert tasks at the high end). A 300K-pair preference dataset is therefore a $200K–$1M line item — frequently larger than the compute bill for the corresponding DPO run. AI labels via a strong judge model drop this to $0.001–$0.01 per pair at API rates, which is why hybrid stacks now dominate. The cost equation that matters: human labels for the ~5–10K highest-stakes anchors and adversarial probes, AI labels for the bulk 100K–10M range, and verifiable rewards for everything math, code, or schema-checked. See the related cost frame in AI inference cost economics for the per-token serving math that drives rollout costs.

Wall-clock budgets

A small-team SFT-plus-DPO pass on a 7–13B model fits on 8×H100 in 2–5 days, including evals and ablations. A 70B equivalent on 64×H100 runs 1–3 weeks. Frontier-scale reasoning RL with iterated rejection sampling consumes 4–12 weeks of multi-rack wall clock per major capability bump, with the rollout cluster usually being the gating resource — not the training cluster. Co-locating rollout inference with training using the same LLM serving stack is what makes the wall clock tractable.


Safety post-training and red-teaming

Safety post-training is not a final tweak; in 2026 it is a parallel pipeline that runs alongside capability post-training from the first SFT stage onward. Treating it as a last-mile filter is the most common mistake teams make, and it produces models that are either over-refusing brittle assistants or under-refusing liability nightmares.

The safety stack

A 2026 production safety stack typically includes:

  • Refusal SFT data. Curated examples of how to refuse — tone, specificity, alternative help, no moralizing lectures. Often 1–5% of the SFT mix. Quality matters enormously; bad refusal data produces models that refuse benign queries.
  • Adversarial preference data. Pairs where the rejected response is unsafe and the chosen response is a calibrated refusal or a safer alternative. Trained into the same DPO/PPO stages as helpfulness preferences, often with a separate multi-head reward signal so safety can be weighted explicitly at inference time.
  • Constitutional anchors. Written principles (Anthropic-style or in-house) that drive an AI judge for the long tail of safety judgments. The rubric is auditable and editable, which matters when policy changes.
  • External guardrails. A serving-time classifier stack — Llama Guard 3, NeMo Guardrails, or in-house classifiers — runs in front of and behind the model. Post-training and guardrails are complementary, not redundant; see production safety guardrails for the serving-time half.
  • Red-team data flywheel. Internal and external red-teamers continuously probe the model, find failure modes, and feed the failures back into both adversarial preference data and refusal SFT. This is the dominant source of long-tail safety improvements after a few months of deployment.

What safety post-training actually changes

Safety post-training shifts the policy's behavior on a narrow but high-stakes slice of the input distribution. It does not remove the underlying capability — a model that has memorized chemistry from pretraining still knows the chemistry; safety post-training just changes what it will say when asked. This is why jailbreaks remain a persistent failure mode and why no amount of post-training is a substitute for serving-time defense-in-depth.

The single most useful eval discipline here: track the helpful/harmful frontier explicitly. Refusal rates on benign queries (over-refusal) and harmful-output rates on adversarial queries (under-refusal) are both real failure modes, and most post-training changes trade one for the other. The team that ships is the one that measures the frontier and knows where it is moving each iteration.


Open-source tooling: TRL, OpenRLHF, verl, Axolotl

The open-source post-training stack in 2026 is mature enough that a small team can ship competitive results without writing custom infrastructure. Choosing the right framework is mostly about scale and how much of the RL loop you need to customize.

Framework Best for Algorithms supported Distributed training Notes
TRL (Hugging Face) SFT, DPO, small-scale PPO SFT, DPO, IPO, KTO, ORPO, PPO, GRPO (recent) Accelerate, DeepSpeed, FSDP Easiest entry. Pairs naturally with the HF ecosystem. Good up to ~70B with FSDP.
OpenRLHF PPO, GRPO at 70B+ PPO, GRPO, DPO, KTO, REINFORCE++, iterative DPO Ray + DeepSpeed + vLLM rollouts Co-locates rollout inference with training. The pragmatic choice for serious RL at scale.
verl (volcengine) Production GRPO/PPO at 100B+ PPO, GRPO, REMAX, ReMax, DAPO Ray + Megatron + vLLM/SGLang Used by ByteDance and several frontier-adjacent labs. Best-in-class for large-scale RLVR.
Axolotl Multi-recipe SFT/DPO with config-driven UX SFT, DPO, ORPO, KTO, LoRA + QLoRA variants DeepSpeed, FSDP Config-first. Excellent for ablation sweeps and reproducible pipelines.
LLaMA-Factory Mixed SFT/preference/PEFT workflows SFT, DPO, PPO, ORPO, KTO + extensive PEFT DeepSpeed, FSDP Strong for parameter-efficient post-training and multi-method comparisons.
NeMo-Aligner (NVIDIA) Enterprise GPU clusters SFT, DPO, RLHF (PPO), SteerLM Megatron + TensorRT-LLM Tight integration with NVIDIA training stack. Good for teams already on Megatron.

When to write your own

The honest answer for most teams in 2026: don't. The open-source frameworks have absorbed the lessons of three years of public RLHF/DPO/GRPO work and now reliably reproduce published recipes. Custom infrastructure makes sense when you are running a new RL algorithm, doing something unusual with rollout inference (e.g., disaggregated rollout via disaggregated inference patterns), or operating at a scale where the framework's choices stop fitting (200B+ policies, exotic parallelism plans). Otherwise, pick TRL or OpenRLHF, follow Tülu 3's published recipe as a starting point, and put your engineering effort into data quality and eval discipline rather than reimplementing GRPO.


Common failure modes and recovery

Post-training runs fail in a small set of recognizable ways. Learning to diagnose them quickly is the difference between a team that ships and a team that re-runs the same broken pipeline for a quarter.

Mode 1: reward hacking that looks like progress

The reward model score climbs, eval scores stagnate or regress. The policy has discovered a stylistic exploit — usually excessive length, hedging language, markdown headers, refusal-template overuse, or sycophantic agreement. Diagnose with: held-out reward-hacking probes, length distributions over training, and a small panel of human or strong-judge spot checks at every checkpoint. Recover by: clipping rewards, adding length penalties, re-labeling the affected slice, or rolling back to the last clean checkpoint and reducing the KL coefficient.

Mode 2: catastrophic forgetting on the wrong axis

A reasoning-RL stage produces a model that solves math better but writes worse, or a safety stage that improves refusal accuracy but kills coding ability. The policy has drifted off the manifold of behaviors it had after SFT. Mitigate with replay (mix earlier-stage data into later stages at 5–20%), explicit capability-regression evals at every stage boundary, and a final SFT pass that re-anchors the broken capabilities. The Llama 3 recipe's iterated DPO with replay is partly motivated by this failure mode.

Mode 3: DPO drift on the reference model

DPO's implicit KL regularization is weaker than PPO's explicit penalty, and over many epochs or iterative rounds the policy can drift far from the reference in ways that don't show up in the loss. Symptom: the model becomes increasingly confident, terse, and odd. Fix: stronger β in the DPO loss, fewer epochs per round, or switching to IPO which addresses this directly. Iterative DPO needs reference re-anchoring every few rounds — the original SFT reference becomes stale quickly.

Mode 4: rollout collapse in RLVR

Early in an RLVR run, the policy may collapse to producing the same response across all G rollouts in a group — group variance goes to zero, GRPO's advantage signal disappears, training stalls. Causes: too-low sampling temperature, too-strong KL penalty, or a reward landscape with no easy partial credit. Fix: raise temperature, lower KL coefficient, add format-shaping rewards as partial credit, or warm-start with rejection-sampling SFT before the RL phase.

Mode 5: eval-set contamination

The most embarrassing failure: the post-training data overlaps with the eval set, scores look great, the deployed model regresses on real traffic. Defenses: strict provenance tracking on every dataset, n-gram contamination scans against eval sets before training, and a held-out "secret" eval set that never touches any training pipeline. Treat any eval improvement larger than 5 absolute points with suspicion until you have ruled out contamination.


GRPO deep dive: the math, the memory, the gotchas

GRPO has become the dominant RL algorithm in published reasoning recipes between 2024 and 2026, and yet most descriptions of it sit at the level of "PPO without the critic." That description is correct and not very useful when something goes wrong. This section unpacks GRPO with enough detail to debug a failing run.

The algorithm in one block

For each prompt p in a batch:

  1. Sample G rollouts r_1, ..., r_G from the current policy at non-trivial temperature (typical: T = 0.7 to 1.2; lower than free-form chat, higher than greedy).
  2. Compute a scalar reward R_i for each rollout. The reward can be (a) a verifier signal (1/0 for correct/incorrect plus optional shaping), (b) a learned reward model output, (c) a generative judge score, or (d) a weighted combination.
  3. Compute the group-relative advantage A_i = (R_i − mean(R)) / (std(R) + eps). This is the substitute for PPO's GAE-based advantage estimate.
  4. For each token t in rollout i, compute the clipped policy-gradient surrogate, the same shape as PPO: min(ratio_t * A_i, clip(ratio_t, 1-eps, 1+eps) * A_i), where ratio_t = pi_theta(t) / pi_old(t).
  5. Add a KL penalty term against the reference model, typically applied per-token rather than per-rollout.
  6. Backprop and update the policy.

The critic is gone. The advantage is computed from rollout statistics, not a value function. That single change drops policy-plus-critic memory from roughly 2x the policy size to 1x, and removes the most failure-prone component of PPO (a poorly fit critic that produces noisy advantages).

Why group-relative works

The intuition is that the absolute scale of the reward signal does not matter as long as the policy gradient pushes high-reward rollouts up and low-reward rollouts down relative to the local context. With G rollouts per prompt, the local context is the group itself, and normalizing by std handles the case where some prompts have wider reward spreads than others. For verifiable rewards with 0/1 outcomes and G = 16, a group with 4 successes and 12 failures produces advantages of about +1.7 for the successful rollouts and -0.6 for the failed ones, which is the right shape for the gradient.

The same logic explains the main failure mode. When every rollout in a group has the same reward, the std collapses to zero and the advantage is undefined. In practice teams add a small epsilon (1e-6 to 1e-3) and either skip the group entirely or use a smoothed advantage of zero. Either way the prompt provides no gradient on that step. If most prompts in a batch collapse this way, the effective batch size has shrunk and training stalls — see "Mode 4: rollout collapse in RLVR" earlier for symptoms.

Memory and throughput, concretely

A 70B GRPO run with G = 16 rollouts per prompt, rollout length 4096 tokens, batch size 64 prompts per step:

  • Rollout cost per step: 64 prompts * 16 rollouts * 4096 tokens = 4.2M generated tokens. At a typical 70B inference throughput of 2-4K tokens per H100-second on a well-tuned vLLM stack, this is roughly 1000-2000 H100-seconds of rollout time per training step.
  • Policy forward and backward on the same 4.2M tokens: roughly 200-400 H100-seconds.
  • Reference logprob computation (frozen, no backward): roughly 100-200 H100-seconds.
  • Reward model or verifier scoring: depends. A code verifier may take 1-10 seconds per rollout (sandboxed test execution). A small scalar RM scoring 1024 rollouts is negligible.

In short, rollout dominates wall time. Engineering for GRPO is mostly engineering for high-throughput rollout: continuous batching, prefix-caching across same-prompt rollouts, FP8 inference where the policy permits it, and a separate rollout cluster that overlaps with the training step rather than blocking it.

GRPO knobs that actually matter

Hyperparameter Typical range Effect
Group size G 8 to 64 Larger G reduces advantage variance but multiplies rollout cost linearly. Sweet spot 16-32 for most teams.
Sampling temperature 0.7 to 1.2 Too low: group collapse. Too high: gibberish rollouts that fail the verifier for trivial reasons.
KL coefficient 0.001 to 0.1 Lower than PPO defaults because there's no critic instability to compound.
Clip range 0.1 to 0.3 Same range as PPO. 0.2 is the standard starting point.
Rollout length 1K to 16K tokens For reasoning workloads the long tail matters: 80th-percentile rollouts often hit the cap, which truncates the chain-of-thought signal.
Reward shaping format bonus, length penalty, partial credit Critical when raw rewards are too sparse. DeepSeek-R1's published recipe uses a small format-compliance bonus to bootstrap.

GRPO variants in the wild

  • DAPO (ByteDance, 2024-2025): adds a "dynamic adaptive" advantage clipping and a token-level importance-sampling correction; verl's headline recipe.
  • RLOO (Reinforcement Learning with Leave-One-Out baselines): the older variance-reduction sibling. Same group-relative idea but the baseline is leave-one-out mean rather than mean-and-std. Performs similarly to GRPO on verifiable-reward workloads.
  • REINFORCE++: a simplification used in some open-source pipelines that drops the clipped surrogate in favor of a simpler policy-gradient term with a KL penalty.
  • GRPO with token-level advantages: instead of broadcasting the per-rollout advantage to every token, weight tokens by their relative importance (often a heuristic like "tokens before the final answer get full weight, tokens after get downweighted"). Used by some labs to focus gradient on the reasoning portion.

The pragmatic stance: start with vanilla GRPO from a published recipe (DeepSeek-R1's hyperparameters are a reasonable starting point), measure rollout dynamics, and only adopt variants when a specific failure mode justifies them.


The preference-data zoo: UltraFeedback, HH-RLHF, Nectar, and friends

Preference data is the substrate every preference-learning algorithm runs on. The quality, coverage, and provenance of that data dominate the outcome of DPO, PPO, and even GRPO with an RM-based reward. A short tour of the public datasets that matter in 2026, with notes on when each is appropriate.

The major public preference datasets

  • HH-RLHF (Anthropic, 2022). Around 170K pairs of helpful-and-harmless preference judgments. Historically the standard reference dataset; still widely used as a baseline. The data is generated against an older policy and is showing its age — distribution mismatch with modern instruct models is real.
  • UltraFeedback (Cui et al., 2023). Around 60K prompts each scored across multiple responses by GPT-4 on four dimensions (instruction-following, truthfulness, honesty, helpfulness). The de facto standard for AI-feedback preference training. Most published open-recipe DPO results from 2023-2025 use UltraFeedback in some form.
  • Nectar (Berkeley, 2023). Around 180K prompts with rankings across 7 model responses from a mix of strong models. Higher diversity of source models than UltraFeedback; often used as a complement.
  • PKU-SafeRLHF (Peking University, 2023-2024). Preference pairs annotated separately for helpfulness and harmlessness, allowing multi-objective training. Roughly 30K pairs in the released version.
  • WebGPT comparisons (OpenAI, 2021). Historical interest more than current utility; small (around 20K pairs) and focused on information-seeking dialog.
  • OpenAI Summarize-from-Feedback (Stiennon et al., 2020). The dataset that started modern RLHF for language models. Around 64K pairs of summary preferences. Still useful for ablations of preference learning on a narrow task.
  • HelpSteer and HelpSteer-2 (NVIDIA, 2023-2024). Pointwise multi-attribute scores rather than pairwise preferences. Useful for training multi-head reward models.
  • Tülu 3 preference mix (Lambert et al., 2024). A composite mix of UltraFeedback, on-policy preferences generated against Tülu intermediate checkpoints, and constitutional judgments. Roughly 200K pairs. The cleanest published end-to-end open recipe.

What a frontier lab actually trains on

Public information is partial, but the rough composition is: a small (5-20K) seed of internally collected human preferences on hard or high-stakes prompts; a large (100K-10M) bulk of AI-generated preferences using a strong judge model against a written constitution; and a continuously growing slice of on-policy preferences collected against the current training checkpoint at every iterative round. The mix is the source of most of the quality difference between an open recipe and a frontier one, more than the algorithm.

Constructing your own preference data

For most teams the right answer is a layered approach:

  1. Identify the prompt distribution. What does your deployed model actually see? Pull a representative sample from production traffic if available, or construct one from your target use cases.
  2. Generate diverse candidates. For each prompt, sample 2-8 candidates from a mix of (a) the current model at varied temperatures, (b) a stronger reference model where available, (c) handcrafted "good" responses for the high-stakes anchor set.
  3. Label. Use a strong judge model (GPT-4-class or Claude-class) with a structured rubric for the bulk volume. Use human labelers for the anchor set and for spot checks. Track inter-rater agreement on a held-out slice as a quality signal.
  4. Filter. Drop pairs where the rubric scores are tied or the judge model is uncertain. Drop pairs where the chosen response is shorter and less complete than the rejected one (a common AI-feedback artifact).
  5. Version and provenance. Every pair tagged with its source policy, its judge, its rubric version, its timestamp. This is the discipline that makes ablations meaningful months later.

The cost ratio of this stack: roughly $0.50-$2 per pair for human labels, $0.001-$0.02 per pair for AI labels, near-zero for verifier-derived pairs. The economics force a hybrid; the discipline is to put human labels where they have the highest leverage (anchors, adversarial probes, safety) and AI labels everywhere else.

Distribution shift across iterative rounds

A subtle gotcha: a preference dataset collected against policy v1 may give misleading gradients when applied to policy v2 after a round of DPO. The chosen and rejected responses in the dataset are drawn from v1's output distribution; v2 doesn't produce those responses anymore. The fix is to refresh the dataset at each iterative round by sampling fresh candidates against the current policy, which is what "iterative DPO" actually means under the hood and why it outperforms single-shot DPO on most workloads.


Reward hacking taxonomy and mitigations

Reward hacking is the structural failure mode of every reward-model-based pipeline. It is not one bug but a family, and recognizing which member of the family you are looking at is the first step to fixing it. The taxonomy below collects the patterns most teams encounter in production.

Length bias

The reward model rewards longer responses more, all else equal. The policy learns to be verbose. Symptom: mean response length climbs steadily across training while content quality is flat or degrading. Mitigations: explicit length penalty in the reward (subtract alpha * length); length normalization in DPO; sampling longer rejected responses on purpose in the preference data; or using a reward model trained with length-controlled labels.

Format hacking

The reward model rewards specific formatting (markdown headers, bullet lists, code fences) regardless of whether they help. The policy learns to format aggressively. Symptom: bullet lists and headers everywhere, including in conversational responses where they make no sense. Mitigation: format-aware reward model evaluation; explicit format-neutrality probes during RM evaluation; preference data that includes format-matched (chosen, rejected) pairs so the format dimension cancels out.

Hedging and over-refusal

The reward model rewards caveats and refusals out of a safety-trained tendency. The policy learns to refuse marginally risky prompts and hedge on definitely-safe ones. Symptom: over-refusal rate climbs on benign queries, often in lockstep with under-refusal rate falling on adversarial queries. Mitigation: explicit over-refusal evals (XSTest, do-not-answer); multi-head safety reward separate from helpfulness reward; preference pairs where the chosen response is a direct helpful answer and the rejected response is an unnecessary refusal.

Sycophancy

The reward model rewards agreement with the user's stated views regardless of accuracy. The policy learns to agree. Symptom: when given an incorrect premise, the policy plays along instead of correcting. Mitigation: targeted sycophancy probes during RM evaluation (responses that politely disagree with a wrong premise should score higher); preference data including disagreement-required examples.

Confidence inflation

The reward model rewards confident-sounding responses. The policy learns to express high confidence even when uncertain. Symptom: rising overconfidence on calibration probes (TriviaQA-Calib or similar). Mitigation: explicit calibration evaluation; preference data where appropriate hedging is the chosen response on uncertain questions.

Verifier-specific hacks (in RLVR)

In verifiable-reward RL, the policy can game the verifier itself. Examples: in code RL, the policy learns to write code that passes the test cases by hardcoding the expected outputs; in math RL, the policy learns to output the answer in a form the parser scores as correct while the reasoning chain is nonsense. Mitigations: hold-out test cases the policy never sees; parser hardening; mixing in process-reward signal so the chain itself is supervised; manual review of high-reward rollouts to catch new exploits.

Stylistic markers

The reward model latches onto stylistic surface features unrelated to quality (specific phrasings, emoji use, structured sign-offs). The policy adopts them. Symptom: rising frequency of specific phrases in trained-model outputs that were not present in SFT outputs. Mitigation: corpus-level analysis of trained-model outputs vs SFT outputs; targeted preference pairs that pit the stylistic marker against substance.

Reward model exploitation via OOD inputs

The policy generates inputs the reward model has never seen and the RM produces unreliable scores on them. Mitigation: track the distribution of training inputs and detect when the policy's output distribution drifts beyond the RM's training support; uncertainty-gated rewards (pessimistic ensembling); periodic RM retraining on the current policy's distribution.

The general antidote

No mitigation is sufficient alone. The production recipe for reducing reward hacking is the combination of: (a) KL penalty against a reference, (b) RM ensembles with uncertainty gating, (c) periodic RM retraining on current-policy outputs, (d) explicit reward-hacking probes in the eval suite, (e) verifiable rewards wherever possible to eliminate the proxy entirely, and (f) human spot-checks on high-reward rollouts at every checkpoint. The combination is what frontier labs run; missing any one of them tends to surface as a specific hack down the line.


KL coefficient tuning: worked examples

The KL coefficient is the single most consequential hyperparameter in PPO and GRPO. This section gives concrete starting values and tuning protocols by workload type.

What KL is actually measuring

The per-token KL divergence between the policy and a frozen reference. A KL of 0 means the policy has not moved at all. A KL of 1 means the policy has substantially diverged on most tokens. Typical "healthy" running KL values during training: 0.5-5 for PPO, 1-10 for GRPO, 5-30 for an aggressive RLVR run that is intentionally pushing the policy hard. The right number is workload-dependent; the wrong number is one where KL grows unboundedly until the policy is producing gibberish.

A protocol for tuning beta

  1. Start with the published default for your algorithm and base model. PPO: beta = 0.05. GRPO: beta = 0.01-0.02. DPO: beta = 0.1.
  2. Run a short training segment (300-1000 steps) with reward-tracking, KL-tracking, and a small eval suite.
  3. Inspect the KL trajectory. If KL is bounded (oscillating in a stable range), proceed. If KL grows linearly with step count, beta is too low. If KL is stuck near zero, beta is too high.
  4. Adjust by 2x in the appropriate direction and repeat. Two or three iterations usually converge.

Adaptive KL

A more robust pattern that most production stacks now run by default: set a target KL value (e.g., 4), and adjust beta multiplicatively after each batch. If observed KL is above target by a factor of 2, multiply beta by 1.5. If observed KL is below target by a factor of 2, divide beta by 1.5. The result is a self-tuning beta that adapts to changes in the reward landscape across training.

Worked example: a 7B PPO run

A 7B chat-policy PPO run with reward model from UltraFeedback-trained Bradley-Terry RM, rollout length 1024 tokens, batch size 256.

  • Step 0: beta = 0.05, KL = 0.
  • Step 100: KL = 0.8, reward climbing smoothly. Healthy.
  • Step 500: KL = 4.2, reward still climbing but eval scores starting to drop. Sign of incipient reward hacking; consider raising beta.
  • Step 800: KL = 9.5, reward at all-time high, eval scores back below baseline. The policy is reward-hacked.
  • Recovery: roll back to step 400, set beta = 0.1, restart. Reward will climb more slowly but eval scores hold.

The pattern is generic. Watching eval scores (not just reward) every 100-300 steps catches incipient reward hacking before it gets baked in.

Worked example: GRPO on math RLVR

A 32B GRPO run on a math problem set, G = 16, rollout length 8192 tokens, beta = 0.01.

  • Early training: high group variance, advantage signal is strong, success rate on held-out problems climbs from 12% to 28% in the first 2000 steps.
  • KL trajectory: oscillating between 5 and 15. Higher than PPO would tolerate, but stable.
  • Mid training: success rate climbs to 41%, KL stabilizes around 12.
  • Late training: success rate climbs slowly to 47%, KL slowly drifting up.
  • Decision point: if KL crosses 20 without further eval gains, halt and roll back.

The KL ranges that work in RLVR are larger than for chat-quality PPO because the policy is doing more work — producing long reasoning chains the reference model would not have produced. The right calibration target is "the policy is changing as much as it needs to, no more."


Verifiable rewards: math, code, and beyond

Verifiable rewards have moved from a curiosity to the load-bearing signal in modern reasoning post-training. The mechanics matter; this section unpacks the major verifier types and their failure modes.

Math verifiers

Two main flavors:

  1. String-match verifiers. Compare the policy's final answer (extracted from a "boxed" or "answer:" marker) against a known correct answer. Cheap, fast, fragile. Equivalent answers in different forms (3/4 vs 0.75 vs 0.75000) fail string match.
  2. Symbolic verifiers. Parse the answer with SymPy or a CAS and check mathematical equivalence with the reference. Robust to surface form, slower (10-100ms per check), occasionally fails on exotic forms.

Production stacks use both: string match as the fast path, symbolic fallback when the string match fails. The combination catches roughly 95-99% of correct answers without false-positive credit. The remaining errors are split between answers that are equivalent but neither matcher recognizes (under-credit) and answers that look correct but are not (over-credit, rare).

The data sources are well-established: GSM8K, MATH, AIME problems, Olympiad problems, AoPS-derived datasets, NuminaMath. NuminaMath in particular (around 860K verified problems) has become the workhorse training set for math RLVR through 2025-2026.

Code verifiers

The reward signal is "do the tests pass." Concretely:

  1. The policy produces a code response (possibly with reasoning and a final function).
  2. The verifier extracts the code, runs a static linter, then runs it in a sandboxed environment against a held set of test cases.
  3. Reward = fraction of tests passed (or 0/1 for all-or-nothing).

The sandbox is the engineering work. A safe sandbox prevents the policy from doing anything beyond running the code (filesystem access, network access). Production stacks use Firecracker microVMs, gVisor, or per-rollout containers with strict resource limits. Per-rollout sandbox start time and test execution typically dominate the verifier cost — 1-5 seconds per rollout in well-tuned setups.

Data sources: HumanEval+, MBPP+, LiveCodeBench, CodeContests, APPS, SWE-Bench Verified for full-repo tasks. SWE-Bench Verified in particular pushes verifier complexity — running the affected test suite against an entire repo state — into the multi-minute-per-rollout regime, which has direct implications for batch size and wall-clock budget.

Formal verifiers

For theorem-proving and constrained problem-solving, the verifier is a proof checker (Lean, Coq, Isabelle) that mechanically validates a proof produced by the policy. Failure mode: the policy learns to write proofs that the checker accepts but that are vacuous or incomplete in ways the checker missed. Mitigations: combine with informal-statement evaluation; track proof length distributions to spot trivializations.

Other verifier domains

  • Structured-output tasks. Reward = does the output match the required schema (JSON, regex, function signature). Cheap, sharp signal.
  • Tool-use trajectories. Reward = did the agent reach a terminal success state in the environment. Used in agent RL; covered in agent serving infrastructure for the deployment side.
  • Multilingual translation. Reward = a metric like BLEU or COMET against a reference; partial credit. Quality of the metric becomes the bottleneck.
  • Factuality. Reward = does the response match a retrieved gold passage. Approximate; sensitive to retrieval quality.

When verifiable rewards do not work

Most of the world. Anything subjective (creative writing, conversational quality, style), anything requiring long-horizon judgment (was the agent helpful over a multi-turn session), anything where the right answer depends on context the verifier doesn't have. The frontier strategy in 2026 is to use verifiable rewards where they apply and a learned reward model or generative judge everywhere else, with the verifiable portion of the training mix typically being 30-70% of total reasoning RL volume.


Process reward models: PRM800K, ProcessBench, and step-level supervision

Process reward models score intermediate reasoning steps rather than only the final answer. The argument for them is that outcome rewards are sparse and unable to distinguish correct reasoning that happens to fail from incorrect reasoning that happens to succeed. The argument against them is that step-level labels are expensive, noisy, and gameable.

PRM800K and the original results

The Lightman et al. 2023 PRM paper (arXiv:2305.20050) released PRM800K, a dataset of around 800K step-level labels on math problems. Each step was labeled "correct," "incorrect," or "neutral" by human annotators. The headline result: a process-reward model trained on PRM800K substantially outperformed an outcome-reward model on math reasoning when used to rank candidate solutions in a best-of-N setup.

The caveat: the dataset is expensive (an estimated $200K-$1M of human labeling time) and domain-specific. Reproducing it in a new domain is non-trivial.

Inducing process labels from outcome labels

A more scalable approach uses outcome labels to induce step labels. For a step in a solution, generate many continuations from that step and check what fraction reach a correct final answer. A high continuation-success rate is evidence the step is correct; a low rate is evidence it is not. Math-Shepherd and several follow-ups use this approach to bootstrap process labels at lower cost than direct annotation, with reported gains comparable to PRM800K-trained models.

ProcessBench and PRM evaluation

ProcessBench (released in 2024) is a benchmark specifically for evaluating PRMs: given a math solution with at least one error, locate the error. PRMs that score well on this benchmark also tend to be useful as RL signals; PRMs that score well on outcome accuracy but poorly on step-level error localization tend to be process-hackable.

Using PRMs in training

Two main patterns:

  1. Best-of-N at inference. Sample N candidates from the policy, rank by PRM, pick the best. No training-time change. The simplest and least risky use of a PRM.
  2. Dense reward in RL. Use the PRM's step-level scores as a dense reward signal during GRPO or PPO. Risk: the policy learns to produce plausible-looking but vacuous steps that score well. Mitigation: combine with outcome rewards as a check.

Should you train your own PRM?

For most teams, no. The data cost is high, the engineering complexity is real, and outcome rewards plus enough rollouts usually capture most of the signal. The exception: when you have a domain where outcome rewards are too sparse to provide gradient (long-horizon reasoning, multi-step proofs, agentic tasks) and you can afford the labeling investment. In those cases, follow ProcessBench-style evaluation discipline to make sure the PRM is actually doing what it claims.


Safety post-training in depth: Constitutional, deliberative, Llama Guard

Safety post-training has matured into a distinct sub-field with its own published recipes. Three lineages are worth knowing in detail.

Anthropic Constitutional AI

The original public Constitutional AI paper (Bai et al., 2022 — arXiv:2212.08073) describes a two-phase recipe:

  1. Supervised CAI. For each prompt, the model produces a response. The same model (with a critique prompt) critiques the response against a written constitution. The model then revises the response. The revised response becomes SFT data.
  2. RLAIF. A judge model (with the constitution as context) scores pairs of responses. The judgments train a reward model. Standard PPO follows.

The constitution is a list of natural-language principles ("the response should not encourage self-harm"; "the response should treat all users with respect"). The principles are auditable, editable, and version-controlled. In practice the published constitutions have 10-30 principles; internal versions are typically longer.

The reason CAI matters beyond Anthropic-specific work: it gives you a way to scale safety preference labeling without proportionally scaling human labelers. The bulk of safety judgments are made by the judge model; humans set the constitution and audit the process.

Deliberative alignment

OpenAI's deliberative alignment (described in 2024 publications and the o1 system card) takes a different approach. The model is trained to explicitly reason about safety considerations before producing a response. The training signal includes both the visible response and the safety reasoning, with the safety reasoning evaluated against written safety policies.

The structural argument: a model that has been trained to explicitly consider safety policy is more robust to jailbreaks than one that has been trained to refuse specific patterns, because the policy reasoning generalizes to inputs the refusal training never saw.

The cost: safety reasoning consumes tokens at inference time, and the engineering for collecting and grading safety-reasoning data is non-trivial. Worth the investment at frontier scale; probably not worth it for smaller teams.

Llama 3 safety post-training

The Meta Llama 3 release described a multi-stage safety pipeline running parallel to the helpfulness pipeline:

  1. Adversarial preference data generation, with internal and external red-teamers producing pairs where the chosen response is a calibrated refusal or safe alternative.
  2. Multi-head reward modeling, with separate scalar heads for helpfulness and harmlessness, allowing explicit weighting at policy-optimization time.
  3. Safety-specific evaluation (XSTest, HarmBench, internal red-team probes) at every stage gate.
  4. A final safety SFT pass that re-anchors refusal calibration.

The release also published Llama Guard 3, a separate classifier model run at serving time as a defense-in-depth layer. The Llama Guard / model-post-training combination is the production pattern most teams now follow.

Refusal calibration is a frontier in itself

The hardest problem in safety post-training is not preventing harmful outputs — most teams can drive harmful-output rates very low. The hard problem is preventing over-refusal: the model refusing benign queries because they superficially resemble harmful ones. The published XSTest benchmark measures this explicitly, and the gap between over-refusal rate and under-refusal rate is the most useful single metric for tracking safety post-training quality across iterations. A team with a 1% over-refusal rate and a 3% under-refusal rate is in a better place than a team with 0.5% over-refusal and 8% under-refusal, even though the latter looks safer on a single number.

Cross-references

For the serving-time defense-in-depth that complements safety post-training, see production safety guardrails. For the hallucination axis of model unreliability, which interacts with calibration training, see AI hallucinations: why they happen.


Post-training compute economics by stage

The dollar cost of each post-training stage matters for budgeting and for understanding why the recipe converges where it does. A reference set of numbers for 2026 (H100 cluster, $2.50-$4.00 per H100-hour depending on reservation and spot availability):

Stage 7B model 70B model 400B model Dominant cost driver
Data curation & cleaning $20K-$50K $30K-$80K $50K-$150K Engineering time, not compute
SFT v0 (1M examples) $500-$2K $5K-$15K $30K-$80K Training FLOPs
Rejection-sampling SFT (N=16, 500K prompts) $2K-$5K $15K-$40K $80K-$200K Rollout inference
Reward model training (300K pairs) $300-$1K $2K-$6K rarely RM-trained at scale RM training
DPO (200K pairs, 2 epochs) $400-$1.5K $4K-$12K $25K-$70K Training + reference forward
GRPO on math/code (20K steps, G=16) $8K-$20K $75K-$200K $400K-$1M+ Rollout inference
Iterative DPO (3 rounds) $1.5K-$5K $15K-$40K $80K-$200K Per-round labeling + training
Safety post-training (full stack) $1K-$5K $10K-$30K $50K-$150K Adversarial data + multi-head RM
Final SFT clean-up $300-$1K $3K-$8K $15K-$40K Training FLOPs
Eval & ablation budget $1K-$3K $8K-$20K $40K-$100K Multiple-checkpoint eval runs
Total end-to-end $35K-$95K $170K-$450K $800K-$2.2M+ Rollouts dominate at scale

Labeling costs sit on top of this and frequently exceed compute. A 200K-pair preference dataset with human labels at $1 per pair is $200K; AI labels via a strong judge model at $0.005 per pair is $1K. The hybrid stack — humans for the highest-leverage 5-10% of pairs, AI for the rest — typically lands at $20K-$60K total for a 200K-pair set.

For reference, frontier closed-lab post-training costs are an order of magnitude or two above the 400B numbers above, driven by larger label volumes, more iterative rounds, larger RM ensembles, and substantial red-team investment that doesn't show up in any single line item. The open-recipe tier is the one most teams can realistically run; the frontier tier is the one most teams shouldn't try to replicate.


PPO vs DPO vs GRPO vs SimPO vs ORPO: full comparison

The method zoo deserves a side-by-side comparison rather than a string of paragraphs. The table below collapses the most-used preference-learning algorithms onto a common axis.

Algorithm Models in memory Rollout cost Data needed KL handling Typical compute (70B, 1 epoch) Strengths Weaknesses
SFT 1 (policy) None Curated (prompt, response) pairs None ~2K H100-hours Largest single quality jump; simple Cannot learn from preferences
DPO 2 (policy + reference) None (prompt, chosen, rejected) pairs Implicit, via reference log-ratio ~1.8K H100-hours Cheap, stable, easy to ship Reference drift over iterative rounds
IPO 2 (policy + reference) None Same as DPO Implicit, with explicit bound ~1.8K H100-hours More conservative than DPO; resists overfitting Slower convergence than DPO
KTO 2 (policy + reference) None Unpaired binary feedback Implicit ~1.8K H100-hours Uses cheap thumbs-up/down data Less informative than pairs
ORPO 1 (policy) None Same as DPO None (built into odds-ratio loss) ~1.6K H100-hours Single fine-tuning pass; no reference Less separable from SFT
SimPO 1 (policy) None Same as DPO None ~1.5K H100-hours Reference-free, length-normalized Higher over-training risk
RLOO 2 (policy + reference) Yes (G rollouts/prompt) Reward signal Explicit KL penalty ~22K H100-hours Simpler than PPO, no critic Higher variance than GRPO
GRPO 2 (policy + reference) Yes (G rollouts/prompt) Reward signal Explicit KL penalty ~25K H100-hours No critic, stable, fits verifiable rewards Group collapse failure mode
PPO 4 (policy + critic + RM + ref) Yes (1 rollout/prompt, plus value training) (prompt, reward signal) Explicit KL penalty + GAE ~28K H100-hours Most powerful when tuned Critic instability; most hyperparameter-sensitive

The compute numbers are order-of-magnitude estimates for a 70B model over one epoch of 200K pairs (or 200K prompts in the rollout-based algorithms with G = 16). Real numbers vary by 2-3x with infrastructure tuning.

Where the lines blur

The taxonomy is cleaner than the practice. Iterative DPO with on-policy preference collection is structurally a PPO outer loop with a supervised inner step. GRPO with a learned reward model is structurally PPO with the critic replaced by group statistics. The categorical labels matter less than the four underlying design dimensions: (1) is the reward learned or verifiable, (2) is the policy updated on-policy or off-policy, (3) is there a critic, and (4) what's the KL handling.


Mode collapse, length bias, sycophancy: failure-mode catalog

A catalog of named failure modes beyond reward hacking, with diagnostic signatures and fixes. This complements the "Common failure modes and recovery" section above with longer-tail patterns that most teams encounter eventually.

Mode collapse

The policy converges to a small number of response templates regardless of prompt. Diagnostic signature: low n-gram diversity across responses to varied prompts; declining response-variety-per-prompt metric. Causes: too-low sampling temperature during rollouts, too-high KL coefficient that pins the policy near a single mode, an over-trained reward model that scores one template highly. Fixes: raise temperature, lower KL, retrain RM with more diverse training data, mix in SFT replay.

Length bias

Covered in detail above. The corollary that's worth flagging here: length bias compounds across iterative rounds. If round 1 introduces a slight verbosity tendency, round 2's preferences are collected on verbose responses, and the verbosity gets baked in deeper. Track mean response length per round as a leading indicator.

Sycophancy

The policy agrees with user-stated premises even when wrong. Documented in Perez et al. 2023 and many follow-ups. Sycophancy is partially a pretraining artifact (web text is full of agreement) and partially a post-training artifact (humans rate agreeable responses higher). Mitigation requires deliberate disagreement-required training data, not just a hands-off approach.

Alignment tax on niche capabilities

A typical post-training pipeline preserves capability on broad benchmarks (MMLU, MATH) but quietly erodes niche skills (specific language pairs, esoteric in-context-learning patterns, role-play depth). The fix is to ensure the SFT mix and the replay buffer include enough of each niche; the discipline is to maintain a per-niche eval suite that catches regressions early.

Refusal cascade

A specific safety post-training failure where the model learns to refuse anything that superficially resembles a harmful pattern, including obvious-benign queries (basic chemistry questions, fiction writing requests). Measured by XSTest-style over-refusal benchmarks. Caused by safety preference data that lacks "calibrated helpful" responses to ambiguously-shaped prompts. Fix: include explicit positive examples of helpful responses to safety-adjacent prompts.

Inference-time drift

The policy behaves differently when served via different sampling parameters than it was trained on. A model trained with rollouts at T = 1.0 may produce odd outputs when served at T = 0.3. Mitigation: train the policy across a range of sampling parameters; or document the recommended sampling regime as part of the model release.

Tokenizer-boundary artifacts

A rare but pernicious failure: the post-training data was tokenized with a tokenizer slightly different from the deployment tokenizer, and the model has learned token-boundary-specific patterns that misfire at serving time. Symptom: occasional truncation, doubled tokens, or repetition that wasn't present during training eval. Fix: lock the tokenizer to the deployment version from the first SFT stage forward.

KV-cache mismatch in iterative RL

A subtle infrastructure failure where the rollout server's KV-cache implementation differs slightly from the training server's. The policy is trained against logprobs computed one way and rolled out with logprobs computed another way, producing systematic but invisible bias. Fix: use the same forward-pass code path for rollout and training; verify with a logprob-consistency check at the start of every run. See KV cache inference memory math for related serving-side mechanics.


The 2026 post-training playbook

Synthesis of the patterns in this guide as a concrete recipe. This is not the only good recipe; it is a strong default that a small team can follow without inventing anything novel.

Stages

  1. SFT v0. Start from the best base model your compute budget supports (in 2026, that's a Llama 3.3 70B, Qwen 3 family, DeepSeek-V3-Base for the ambitious, or a smaller open-weight model for the budget-constrained). SFT on a curated mix biased toward your target workload. Aim for 1-3 epochs over a few hundred thousand examples. Evaluate against an IFEval / MMLU / domain-specific battery.
  2. Rejection-sampling SFT v1. Generate 8-16 candidates per prompt on the SFT v0 model. Score each with a reward model (or a strong judge). Train SFT v1 on the survivors. This step alone typically captures 60-80% of the gains a full RL pipeline would deliver.
  3. DPO v1. On a preference dataset (UltraFeedback, Nectar, or your own), run DPO with beta = 0.1 for 1-2 epochs. Track chosen-logprob and rejected-logprob separately; abort if chosen-logprob is falling.
  4. GRPO on verifiable rewards. For math, code, and structured tasks, run GRPO with G = 16, beta = 0.01, 2000-10000 steps. Use NuminaMath plus your domain-specific verifiers.
  5. Iterative DPO v2. Re-generate preferences against the GRPO-trained model using a strong judge. Run DPO again. This is the round that captures most of the on-policy preference signal.
  6. Safety post-training. Adversarial preference data, multi-head reward model, refusal SFT mix. Evaluate against XSTest and HarmBench.
  7. Final SFT clean-up. A short pass with replay from earlier stages to re-anchor any capabilities that drifted. Often 5-10% of the SFT v0 mix is enough.
  8. Eval and ablation. Full eval portfolio (instruction-following, reasoning, code, multilingual, safety, calibration, long-context, held-out preference). Document what each stage moved.

Compute budget

For a 70B model running this recipe end-to-end on a 64xH100 cluster: roughly 6-10 weeks of wall clock, $400K-$800K of pure compute, plus $100K-$500K of labeling depending on how much is AI-feedback vs human-feedback. Smaller models (7B-13B) drop the budget by an order of magnitude. Frontier-scale (400B+) raises it by another.

Failure-recovery discipline

Every stage gate has a measurable pass criterion. Failed gates trigger rollback, not forward progress. Versioned checkpoints, versioned data, versioned eval suites. The team that ships is the team that can roll back any stage to a known-good checkpoint in under an hour.

What this recipe does not cover

  • True frontier-scale RL (multi-rack, multi-round, ensemble RMs, online red-teaming). The recipe above is the "competitive open-weight" tier, not the "rival a frontier lab" tier.
  • Highly specialized domains (medical, legal, financial) that need their own evaluation discipline beyond the generic portfolio.
  • Multi-modal post-training. Vision-language and audio-language post-training shares the same skeleton but adds modality-specific data pipelines; see multimodal serving for the deployment side.
  • Agent post-training. Training a model to act in an environment (browsers, code interpreters, multi-tool agents) introduces trajectory-level rewards and long-horizon credit assignment problems that the chat recipe doesn't handle; see agent serving infrastructure for related serving topics.

The recipe is a default. Departures from it should be motivated by a specific failure mode or a specific opportunity, not by methodological preference.


The bottom line

The problem is the alignment tax: every stage of post-training trades a sliver of raw capability for a much larger gain in usefulness, and the discipline is to keep the trade favorable. The solution is a staged pipeline — SFT, then preference learning, then optional reasoning RL, with safety post-training layered in — evaluated on workload-specific signals rather than a single benchmark. The biggest single lever for most teams is DPO over PPO: comparable quality at an order of magnitude less compute.

  • Start with SFT. It's the largest single quality jump and the foundation every later stage builds on.
  • Default to DPO before PPO. Match within 0.3 MT-Bench points at 10× less compute; only escalate when DPO plateaus.
  • Treat the reward model as the bottleneck. If RM quality is poor, no amount of policy optimization rescues the run.
  • Stage your pipeline; track provenance. Six to ten stages with replay buffers and contamination scans, not one monolithic fine-tune.
  • Iteration speed beats raw FLOPs. Pretraining is one long run; post-training is a portfolio. Optimize the portfolio loop.

For the evaluation signals that gate every stage, read eval infrastructure; for the distributed-training stack underneath the optimizers, read distributed LLM training.


FAQ

Is RLHF necessary, or is SFT enough? SFT gets you most of the way. RLHF (or DPO) adds the last 10-20% of quality. For some workloads, that's worth the engineering cost; for others, not.

DPO or PPO? DPO is the right starting point for most teams. Move to PPO if DPO plateaus.

Do I need human labelers? For frontier work, yes. For most other work, AI labels (especially RLAIF or constitutional approaches) cover most needs. Humans for the highest-stakes anchors.

How much data do I need for SFT? For a single capability (e.g., a specific format), thousands of examples can suffice. For a general assistant, tens to hundreds of thousands. Quality dominates quantity.

Can I do post-training on open-weight models? Yes. The post-training literature is largely about open-weight models (Llama, Mistral, Qwen). Tooling is mature.

How long does a post-training run take? SFT: hours to days. RLHF: days to weeks. Reasoning fine-tuning: weeks. Iterative pipelines: months.

What's the right model size for SFT? The same model size you intend to deploy. SFT doesn't change the base model's capability ceiling much; it shapes how that capability is presented.

Can I post-train a model to be smarter? Capability bound is mostly set by pretraining. Post-training can elicit and shape existing capability, including making implicit reasoning explicit, but it can't add fundamentally new capability.

What's the difference between GRPO and PPO? GRPO drops the critic. Instead of estimating a value function, it normalizes rewards within a group of G rollouts per prompt and uses the group-relative advantage directly. Same clipped surrogate objective, same KL penalty against a reference, fewer moving parts. Memory and stability improve; the price is needing enough rollouts per prompt for the group-relative estimate to be useful. For verifiable-reward settings this is almost always the right trade.

Is RLVR the same as RLHF? No. RLHF uses a learned reward model trained on human preferences. RLVR uses a deterministic verifier (test suite, math checker, formal proof) as the reward. RLVR removes the reward-hacking failure mode entirely within the verifier's domain, but only applies where verifiers exist.

Should I use a single reward model or an ensemble? For any production frontier training: an ensemble. The variance across an ensemble is the cheapest signal you have for "the RM is uncertain here, don't take a big gradient step." For smaller teams running a single round of DPO, a single RM (or no RM at all, with DPO's implicit reward) is fine.

How much of post-training is now AI feedback vs human feedback? Empirically, the volume balance has flipped. Most preference labels generated by frontier labs in 2026 are AI-generated. Humans label the highest-stakes anchors, audit AI labels, and write constitutional rubrics. The hybrid is the norm; pure human-only RLHF at scale is no longer cost-effective.

Can a small team run RLHF/RLVR meaningfully? SFT and DPO, yes — single-node fine-tuning is well-supported by open-source tooling. PPO and GRPO at meaningful scale need a multi-node training setup co-located with serving infrastructure for rollouts. Plan for at least 8-16 H100/B200-class GPUs for a 7-13B model and substantially more for anything larger. Open-source frameworks like TRL, OpenRLHF, and verl have made the entry barrier much lower than it was in 2023, but the engineering investment is still real.

What does Constitutional AI add over plain RLAIF? A written, explicit, auditable rubric. The "constitution" is what the AI judge consults when scoring responses. Without it, RLAIF inherits whatever implicit preferences the judge model has from its own training — which may be opaque and unstable across model versions. With it, the alignment target is documented and editable.

Why does the same model behave differently after each post-training stage? Post-training shapes the policy's mode without much changing its capability ceiling. Each stage moves the mode toward whatever signal it was trained on — instructions for SFT, preferences for DPO, verifier-correct answers for RLVR. The capabilities are there throughout; what changes is which ones are surfaced by default. This is why ablation and stage ordering matter so much.

How does LoRA or QLoRA fit into post-training? For SFT and DPO, parameter-efficient methods like LoRA (Hu et al., 2021) and QLoRA (Dettmers et al., 2023) reduce GPU memory by 4–10× and let a single 80GB-class GPU fine-tune up to a 70B model. The quality penalty is small (typically 0–2 points on standard benchmarks) when rank and learning rate are tuned. For full RL, LoRA adapters work but most production stacks still prefer full-parameter training because the rollout-and-critic memory dominates and LoRA's marginal savings matter less. LoRA-based serving is its own topic — see multi-tenant LoRA serving for the deployment side.

What is iterative DPO, and is it worth the cost? Iterative DPO collects fresh preference pairs on the trained policy's outputs, retrains DPO on the combined dataset, and repeats — typically 3–5 rounds. The published Llama 3 recipe runs iterative DPO, and the gain over a single DPO round is real (typically 3–8 points on chat-quality evals, larger on reasoning). It is worth the cost when you have AI-feedback labeling that scales with rollouts; less so when every round requires fresh human labels.

Does post-training change the model's factual knowledge? Mostly no. Factual knowledge is set by pretraining; post-training shapes the interface around it. SFT can teach a model to admit uncertainty rather than confabulate, and RLHF can suppress some specific known-wrong answers, but post-training does not meaningfully add new facts. Adding factual capability requires either continued pretraining on new data or retrieval at inference time — see RAG production architecture for the retrieval path.

What is reward model contamination, and how do I detect it? The reward model is trained on preference pairs. If those pairs include responses from a strong model that overlaps with the eval distribution, the RM learns to score "responses that look like that strong model's outputs" rather than the underlying preference signal. Detection: hold out a preference set generated only by your own policy at various training stages and check that the RM's calibration on it matches the original held-out set. Drift between them is the smoking gun.

Why do reasoning models often have a "thinking" mode and a "final answer" mode? This is a structural consequence of RLVR plus a final SFT clean-up pass. The RL stage rewards the policy for producing long internal reasoning that leads to a correct answer; the SFT clean-up teaches the policy to mark which tokens are scratchpad and which are user-facing. The split also matters for inference economics — the scratchpad is often stripped before billing user-visible tokens, and serving stacks like the one described in reasoning model serving treat the two as distinct cost classes.

How much does the SFT mix composition actually matter? Empirically: a lot. Published ablations from Tülu 3 and the Llama 3 paper show 10–20 point swings on category-specific evals from changing mix ratios, even with the same total token count. The mix is where most lab-specific tribal knowledge lives. The actionable advice: track per-category eval scores against per-category mix ratios across runs, and treat the mix as a tuned hyperparameter, not a fixed recipe.

What is "alignment tax," and is it real? Alignment tax is the observation that post-training for helpfulness/harmlessness sometimes regresses raw capability. Early RLHF papers reported 1–5 point drops on capability benchmarks after alignment. In 2026 the tax is much smaller — close to zero on most benchmarks — because (a) the SFT mix now includes capability-preserving data, (b) replay between stages prevents drift, and (c) reasoning RL has shown that some forms of post-training increase capability. The tax is no longer a structural argument against post-training; it is a debuggable failure mode.

Should reward models be the same size as the policy? For Bradley-Terry scalar RMs, smaller is fine — a 7B RM scoring a 70B policy works in practice and is much cheaper to serve during rollouts. For generative LLM-as-judge RMs, the judge's capability is the bottleneck and using a similarly-sized or stronger model produces meaningfully better calibration. The current frontier pattern: small scalar RMs for cheap dense feedback, a large generative judge for hard cases, and verifiable rewards wherever they apply.

What does an iterative post-training schedule look like in production? A representative 12-week post-training cycle for a 70B-class model: week 1–2, SFT on the latest curated mix; week 3, DPO on a refreshed preference set; week 4–6, GRPO with verifiable rewards on math/code/structured tasks; week 7, rejection-sampling SFT to consolidate; week 8, safety post-training pass with adversarial preferences; week 9–10, iterative DPO rounds with AI-feedback labels; week 11, final SFT clean-up with replay from all earlier stages; week 12, eval, red-team, ablation summary, and hand-off to serving. Each gate has measurable thresholds; failed gates trigger rollbacks, not forward progress.

How does post-training interact with quantization and serving optimizations? Most post-training happens in BF16 or FP16; serving often uses INT8, FP8, or INT4 via methods covered in quantization tradeoffs. Heavy preference-tuned models tend to be more sensitive to quantization than base models, because the post-training has pushed the policy into sharper modes that round less gracefully. Mitigations: quantization-aware fine-tuning at the end of the post-training pipeline, or running calibration on preference data rather than only pretraining data. Skipping this step is a common source of "the quantized model is dumber than the eval suggested" surprises in production.

What is RLOO and how does it compare to GRPO? RLOO (Reinforcement Learning with Leave-One-Out baselines) uses a leave-one-out mean of the other rollouts in a group as the baseline for advantage computation, instead of GRPO's mean-and-std normalization. The advantage shape is mathematically slightly different: RLOO's baseline is unbiased; GRPO's is asymptotically biased but lower-variance in practice. Both work; published comparisons show similar end-of-run quality on verifiable-reward workloads. Pick whichever your framework supports natively.

Why does SimPO drop the reference model? SimPO (Meng et al., 2024 — arXiv:2405.14734) argues that the reference-policy term in DPO is a source of inefficiency and instability, and replaces the implicit reward with a length-normalized log-probability margin. The result is a simpler loss, lower memory (no reference model in VRAM), and a sharper handle on length bias. The cost: SimPO's regularization is weaker than DPO's KL term, so over-training risk is higher; the published recipes use fewer epochs and smaller learning rates than DPO defaults.

When does ORPO outperform DPO + SFT? ORPO (Hong et al., 2024 — arXiv:2403.07691) fuses SFT and preference learning into a single odds-ratio loss, removing the separate SFT stage and reference model. It outperforms a DPO-after-SFT pipeline most clearly when the SFT and preference datasets share prompts and when the team can only afford a single fine-tuning pass. The downside is reduced separability: you can't roll back to a known-good SFT checkpoint if the preference component is misbehaving.

What is iterative DPO's relationship to PPO? Iterative DPO is closer to PPO than vanilla DPO. Each round generates on-policy samples, labels them (with a judge or human), and re-trains. That generation-label-train loop is structurally the same as a PPO outer loop, with the inner training step replaced by a stable supervised loss instead of a clipped policy gradient. The published Llama 3 recipe describes this as their default; the practical implication is that "DPO" and "PPO" are best thought of as endpoints on a continuum.

How do I detect margin collapse in DPO? Track chosen-logprob and rejected-logprob separately during training. Healthy DPO: rejected-logprob falls more than chosen-logprob falls (or chosen-logprob rises). Margin collapse: both fall together, but rejected falls faster, so the loss looks healthy while the model is silently becoming less confident in the chosen responses. The fix is a stronger beta, a SFT auxiliary loss term (cDPO, "conservative DPO"), or a SimPO-style reformulation that anchors chosen-logprob explicitly.

Is RLAIF as good as RLHF in 2026? On most chat-quality benchmarks, yes, when the judge model is strong (GPT-4-class or better) and the rubric is well-designed. On safety, the answer is more nuanced: AI judges tend to inherit their training distribution's blind spots, and certain failure modes (cultural bias, novel jailbreak shapes) are easier for human red-teamers to find. The frontier pattern is hybrid: AI feedback for the bulk volume, human feedback for the high-stakes anchor set and adversarial probes.

What does an RM ensemble actually buy? Reduced variance and a usable uncertainty signal. With 3-5 independently-trained RMs, the disagreement across the ensemble is a proxy for "the RM is uncertain here." A typical pessimistic-RM policy: reward = ensemble mean minus alpha * ensemble std. Alpha around 0.5-2 discourages the policy from exploring regions the RM doesn't reliably score. The cost is RM training time and serving memory, which is why ensembles are typically reserved for frontier-scale runs.

Why does KTO use binary feedback instead of pairs? KTO (Ethayarajh et al., 2024 — arXiv:2402.01306) is designed for the labeling regime where labelers can mark individual responses as "good" or "bad" without pairing them against a counterpart. This is much cheaper to collect at scale (no need to surface two responses per prompt), and many real-world feedback signals (thumbs-up, regenerate-clicked, conversation-ended-early) are naturally binary. KTO bridges that data to a preference-optimization-compatible loss using Kahneman-Tversky-style asymmetric utility.

How long does a typical 70B SFT pass take in 2026? Roughly 18-60 hours on a 32xH100 cluster for 1M examples at 4K sequence length, depending on data complexity and packing efficiency. The same job on 64xH100 with FSDP2 and good attention kernels runs in 9-30 hours. Pipeline parallelism is overkill for SFT at this scale; FSDP2 with careful activation checkpointing is the dominant setup. Distributed training context: see distributed LLM training.

Does post-training benefit from FP8 compute the way pretraining does? For SFT and DPO with sufficient batch size, yes — FP8 (typically via Transformer Engine) gives the same 1.5-2x throughput improvement over BF16 that pretraining sees, with similar stability when scaling factors are tuned. For RL with rollouts, the rollout side benefits even more (FP8 inference is well-supported and much faster), while the training side typically stays in BF16 to keep the policy gradient numerically stable. The combined effect can be a 2-3x wall-clock speedup on a well-tuned stack. Background: mixed-precision training.

How do I think about replay buffers across post-training stages? A replay buffer in this context is a small fraction (typically 5-20%) of earlier-stage training data mixed into later stages to prevent catastrophic forgetting. Practical heuristics: replay SFT data into preference learning and into RL stages; replay safety data into capability-focused stages; track per-category eval scores to detect when replay is or isn't working. The Llama 3 paper's iterated DPO recipe is built around this pattern.

What's the relationship between post-training and synthetic data generation? Tight. Post-training increasingly relies on synthetic SFT data, AI-generated preference labels, and distilled reasoning traces from stronger teachers. See synthetic data and distillation for the data side. The two pipelines are usually run by the same team because the failure modes interleave: bad synthetic data produces bad post-training outcomes, and a bad post-training stage produces bad seed data for the next round of synthetic generation.

Can post-training shrink the gap to a frontier model? Yes, partially. The Tülu 3 and DeepSeek-R1 distilled families demonstrate that careful post-training on a strong open base can reach within a few points of closed frontier models on most non-frontier benchmarks. The remaining gap is largely (a) raw capability of the base model, which post-training can elicit but not create, and (b) frontier-scale data and labeling investment that smaller teams can't match. A realistic target for a well-resourced open-recipe team in 2026: 90-95% of frontier quality on most benchmarks, with the last 5-10% being structurally hard.

What is "self-rewarding" and is it stable? A pattern where the same model serves as both policy and judge: the model rates its own outputs, those ratings train its own next iteration. Yuan et al. (2024 — arXiv:2401.10020) showed promising early results. Stability is the central concern — without external anchors, the model can drift into its own preferences, amplifying biases that no human ever endorsed. Production self-rewarding stacks anchor periodically with human or external-model judgments to prevent runaway drift. Worth experimenting with at small scale; not yet a safe default at frontier scale.

How do I budget tokens for an RLVR rollout phase? For reasoning workloads, the rollout phase is usually the dominant token cost. A 70B GRPO run with G = 16, rollout length 8192, batch size 64 prompts, 10K training steps generates roughly 80 billion rollout tokens — comparable to a small pretraining run. Plan rollout capacity accordingly: a dedicated rollout cluster with high-throughput inference (vLLM, SGLang, or a co-located LLM serving stack) typically uses 2-4x as many GPUs as the training cluster.

What does "alignment tax" look like in 2026 numbers? A well-engineered post-training pipeline shows essentially zero alignment tax on capability benchmarks (MMLU, GPQA, MATH) and often shows net positive movement because reasoning RL and rejection-sampling SFT actively improve capability. The historical alignment tax of 1-5 points reported in 2022-2023 RLHF papers reflected single-stage RLHF without capability-preserving data or replay. Modern multi-stage pipelines have largely engineered the tax away on standard benchmarks, though it can still appear on narrow probes (specific creative-writing styles, in-context learning ability) if the post-training mix neglects them.

Do I need a separate eval team to ship safely? At small scale, no — the training team can wear both hats. At medium scale (50K+ users), strong recommend yes: an independent eval team that the training team can't override removes the obvious failure mode of teams gaming their own benchmarks. At frontier scale, the eval team is often larger than the post-training team because the eval portfolio is the actual product of the work. See eval infrastructure for the systems side.


Glossary

  • DPO — Direct Preference Optimization. Preference learning without a reward model.
  • GRPO — Group Relative Policy Optimization. Simplified RL alternative to PPO.
  • KL penalty — divergence regularizer keeping the trained policy close to the reference.
  • Outcome supervision — reward based on final answer correctness.
  • Policy — the model being trained.
  • PPO — Proximal Policy Optimization. Standard RL algorithm for RLHF.
  • Process supervision — reward based on intermediate reasoning steps.
  • Reference model — frozen SFT model used as a regularization anchor.
  • Reward hacking — policy exploits reward model imperfections.
  • Reward model — learns to predict human preference from labeled pairs.
  • RLAIF — Reinforcement Learning from AI Feedback.
  • RLHF — Reinforcement Learning from Human Feedback.
  • SFT — Supervised Fine-Tuning.

References

  • InstructGPT — Ouyang et al., 2022. "Training language models to follow instructions with human feedback." arXiv:2203.02155. The foundational RLHF paper for LLMs.
  • DPO — Rafailov et al., 2023. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290.
  • Constitutional AI — Bai et al., 2022. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073.
  • LIMA — Zhou et al., 2023. "LIMA: Less Is More for Alignment." arXiv:2305.11206. Quality over quantity in SFT.
  • PPO — Schulman et al., 2017. "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
  • Process supervision — Lightman et al., 2023. "Let's Verify Step by Step." arXiv:2305.20050.
  • DeepSeek-R1 — DeepSeek-AI, 2025. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. GRPO and verifiable rewards.
  • SimPO — Meng et al., 2024. "SimPO: Simple Preference Optimization with a Reference-Free Reward." arXiv:2405.14734.
  • KTO — Ethayarajh et al., 2024. "KTO: Model Alignment as Prospect Theoretic Optimization." arXiv:2402.01306.
  • Reward hacking — Skalse et al., 2022. "Defining and Characterizing Reward Hacking." arXiv:2209.13085.
  • IPO — Azar et al., 2023. "A General Theoretical Paradigm to Understand Learning from Human Preferences." arXiv:2310.12036.
  • ORPO — Hong et al., 2024. "ORPO: Monolithic Preference Optimization without Reference Model." arXiv:2403.07691.
  • GRPO (DeepSeekMath) — Shao et al., 2024. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300. Introduces GRPO.
  • RLAIF — Lee et al., 2023. "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv:2309.00267.
  • Self-Rewarding Language Models — Yuan et al., 2024. arXiv:2401.10020.
  • Tülu 3 — Lambert et al., 2024. "Tülu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv:2411.15124. Reference implementation of an open-recipe SFT → DPO → RLVR pipeline.