Prompt20
All posts
synthetic-datadistillationtraining-datadata-pipelinesguidemodel-collapseself-improvement

Synthetic Data and Distillation: The Complete Guide

The definitive guide to synthetic data and distillation: why the web isn't enough anymore, how labs generate billions of training examples, distillation from large to small models, and the quality-control problems that determine whether it works.

By Prompt20 Editorial · 120 min read

For the first decade of large language models, the data story was simple: scrape the web. By 2024, the web's most useful slice had been ingested by every serious lab, and the marginal value of additional web data was diminishing. The next chapter — increasingly the dominant one — is data the labs generate themselves.

The take: synthetic data went from "useful supplement" to "core infrastructure" between 2023 and 2026. The labs that win in the next generation will be the ones with the best synthetic data pipelines, not the ones with the most web-scraped tokens. Distillation is the inference-side counterpart: take a frontier model's outputs as training data for a smaller one. Both rely on the same insight — strong models can teach themselves and each other, and the bottleneck is quality control, not generation capacity.

Two shifts make this worth taking seriously now. First, the synthetic-data fraction of frontier-lab training mixes climbed from ~10–20% in 2023 to a majority share in 2025–2026 for both pretraining mid-stages and post-training. Microsoft's Phi family is the clearest public case: Phi-1 trained on synthetic "textbook-quality" data (arXiv:2306.11644), Phi-3 (arXiv:2404.14219) is synthetic-heavy by design, and the resulting small models punch far above their weight. Second, DeepSeek-R1 — the most publicly documented post-training pipeline of the past year — uses synthetic reasoning traces from a stronger checkpoint to distill smaller models that retain a striking fraction of the reasoning capability. "Generate from a strong model, filter, train" has gone from niche to load-bearing.

What synthetic data is not: rephrased web data; a capability multiplier (students still have ceilings set by parameter count); a free lunch (quality filtering is the real work). It also isn't a substitute for taste — the framing of the prompts feeding the generator is now one of the higher-leverage decisions in any training pipeline. The labs that win are the ones with verification infrastructure that confirms the generated data points where they wanted. Generation is the easy part; prompt design, quality filtering, deduplication, distribution shaping, and evaluation are where the engineering lives.

Table of contents

  1. Key takeaways
  2. Mental model: synthetic data in one minute
  3. The synthetic data landscape in 2026
  4. Why synthetic data exists
  5. Categories of synthetic data
  6. Generation pipelines
  7. Quality filtering
  8. Quality filtering of synthetic data — deeper dive
  9. The model-collapse question
  10. Distillation: knowledge transfer to smaller models
  11. Distillation methods
  12. Knowledge distillation: which signals transfer
  13. Self-improvement and bootstrapping
  14. Self-improvement loops at frontier labs
  15. Verifiable-reward generation
  16. Synthetic data for safety (red-team data generation)
  17. Infrastructure for synthetic generation
  18. Production deployments
  19. Open problems
  20. Open datasets and recipes worth studying
  21. Economics of synthetic-data pipelines
  22. Detection: how researchers spot distilled models
  23. Dataset deep dive: Alpaca through Tulu 3 and the post-training canon
  24. Pretraining synthetic datasets: Cosmopedia, Nemotron-CC, FineWeb-Edu
  25. Synthetic instruction pipelines: Evol-Instruct, Self-Instruct, Magpie, AutoIF
  26. Distillation deep dive: logit, response-only, on-policy, MiniLLM, DistillKit
  27. R1-Distill technique and model-specific distillation case studies
  28. RLHF preference data: UltraFeedback, HH-RLHF, Constitutional AI
  29. Legal landscape: copyright, fair use, NYT v. OpenAI, output license terms
  30. Distillation detection: fingerprinting models from outputs
  31. The diminishing-returns wall: what 2026 papers are saying
  32. Domain-specific synthetic data recipes
  33. Open datasets worth studying in 2026
  34. Synthetic data infrastructure: batch inference at scale
  35. Frontier lab pipelines: OpenAI, Anthropic, Google, Meta
  36. Quality filtering at scale
  37. Self-improvement loops in depth
  38. Synthetic data for multimodal training
  39. Cost crossover: generating vs labelling
  40. The bottom line
  41. FAQ
  42. Glossary
  43. References
  44. Persona-driven generation: Microsoft Persona Hub
  45. Math-specific synthetic data: OpenMathInstruct, MetaMath, MathPile
  46. Code-specific data: DeepSeek-Coder, StarCoder2, OpenCoder
  47. Contamination detection in depth: substring, MinHash, perplexity, BLEU
  48. R1-Distill model card deep dive: AIME numbers, size scaling
  49. Anthropic's Haiku distillation pipeline (what's public)
  50. Self-improvement: RFT, ReST, RLAIF
  51. Quality classifiers: fastText, cleanlab, vendor pipelines
  52. WildChat and real-conversation datasets
  53. Synthetic preference data: UltraFeedback, Nectar, AI feedback
  54. Cost per accepted example: domain-by-domain

Key takeaways

  • Web data is finite. The marginal useful token from scraping is approaching zero for the largest models.
  • Synthetic data — model-generated training examples — is now a primary training resource at frontier labs.
  • Distillation trains a smaller model on a larger model's outputs. Captures most of the capability at a fraction of the inference cost.
  • Quality control is the hard problem. Generation is cheap; filtering for high-quality examples is the bottleneck.
  • Model collapse (degraded quality from training on synthetic data) is real but largely solved by careful curation and mixing with real data.
  • Verifiable rewards (math, code) make synthetic data especially powerful — you can generate examples and check correctness automatically.
  • Recommendation: invest in a synthetic-data pipeline before chasing more web data. Treat it as a first-class engineering surface.

Quick comparison: distillation and synthetic-data techniques

Technique Teacher access needed Data scale Quality retained (vs teacher) Cost
Response distillation Text outputs only (API OK) 100K-10M samples 70-90% Low — inference only
Logit distillation Full token-level logits 1M-1B tokens 85-95% Medium — needs hidden state
Feature distillation Hidden states / attention 1M-1B tokens 90-97% High — co-located training
Self-distillation Same model, prior checkpoint Variable Marginal (smoothing only) Low
Synthetic SFT data Strong instruct teacher 100K-10M pairs Depends on filtering Low-medium
Rejection sampling Teacher + reward/verifier 10K-1M filtered Very high (best-of-N) Medium-high — many samples
Verifier-filtered (math/code) Teacher + executor 100K-10M Near-teacher on the task Medium

Mental model: synthetic data in one minute

The named problem is the data wall: the useful slice of public web text grew by single-digit percentages year over year while frontier training budgets grew by multiples. The marginal token from a fresh CommonCrawl dump is mostly duplicate, low-quality, or already in the model. Continuing to scale by scraping harder hits a ceiling that arrived in 2024. Synthetic data is the way past it.

The useful analogy is textbook generation by an expert. A senior researcher cannot read more papers per week than they already do, but they can sit down and write exercises, worked solutions, and explanations that distill what they already know into a form a student can absorb. A strong model plays the same role: it cannot ingest novel web tokens that do not exist, but it can convert what it has learned into supervised examples — with a verifier checking the answers when one is available.

Dimension Web data Synthetic data
Supply Plateauing (~hundreds of T tokens, slow growth) Generation-bounded only
Marginal $/token Rising (cleanup, dedup, licensing) Falling (inference is the cost)
Quality variance Wide, hard to control Controllable via prompt + filter
Verifiability Rare High on math/code, medium elsewhere
Diversity ceiling Set by the web Set by the generator and prompt set
Failure mode Toxicity, copyright, contamination Model collapse, mode narrowing

The production one-liner. The classic distillation training loop reduces to:

for prompt in seed_prompts:
    candidates = teacher.sample(prompt, n=N, temperature=0.7)
    kept = [c for c in candidates if verifier(prompt, c)]   # tests, math checker, judge
    sft_dataset.extend((prompt, c) for c in kept)
student.train(sft_dataset, loss="ce")    # optional short RL polish afterwards

Everything interesting is in verifier. Generation is cheap; the filter is the product.

The sticky number: DeepSeek-R1-Distill-Qwen-1.5B matches GPT-4o on AIME after pure SFT on reasoning traces from R1 — a 1.5B-parameter student catching a frontier-class API on a math benchmark, on synthetic data alone, is the existence proof that the data wall is not the capability wall.


The synthetic data landscape in 2026

The space of "synthetic data" techniques has fragmented into a dozen recognizable patterns, each with different cost profiles, quality ceilings, and failure modes. The fastest way to navigate it is by what the technique is generating and what signal supervises the student.

Response distillation

The teacher generates a response to a prompt; the student is trained to produce that response via standard SFT cross-entropy. Cheapest variant, requires only API access to the teacher. This is what Alpaca (Taori et al., 2023) did with Self-Instruct + GPT-3.5 outputs on a Llama-7B base, kicking off the entire open-instruct ecosystem. Quality retained is typically 70–90% of teacher quality on the trained tasks.

Logit distillation

The student is trained to match the teacher's full token-level probability distribution via KL divergence, rather than only the argmax. Captures more information per example. Requires running teacher inference inline with student training, which is expensive in both compute and memory. Roots go back to Hinton, Vinyals, Dean, 2015 (arXiv:1503.02531) and DistilBERT (Sanh et al., 2019 — arXiv:1910.01108). Less common at frontier LLM scale because it requires teacher hidden state or at least full logit access — closed APIs do not provide this.

On-policy synthetic data

The student model itself generates candidate responses on its current distribution, the candidates are filtered or scored, and the survivors train the next iteration of the student. Closely related to rejection-sampling fine-tuning in post-training. The data is "on policy" because it reflects what the current model would say, not what a teacher would say. Strongest signal for capability shaping; weakest for capability transfer.

Rejection sampling

Generate N candidates per prompt with a strong model (teacher or current student), filter to the best-K by a verifier, reward model, or judge, and train on the survivors. Practically the workhorse of frontier post-training in 2024–2026. Cheap to operate, composable with everything else, and produces clean SFT-shaped data.

Self-Instruct and instruction synthesis

Wang et al., 2022 (arXiv:2212.10560). Bootstrap a large instruction-following dataset from a small seed set: prompt the generator with a few seed instructions, ask for more in the same style, deduplicate, validate, repeat. The original recipe was applied to GPT-3 to produce 52K instructions used in InstructGPT-style fine-tuning. Every modern open instruct dataset (Alpaca, WizardLM, OpenHermes, the Tülu mixes) descends from this pattern.

Evol-Instruct

The WizardLM contribution (Xu et al., 2023 — arXiv:2304.12244): start with a seed instruction, then ask the generator to evolve it — add constraints, deepen the reasoning, broaden the topic, increase the complexity. Iterate. Produces a much wider difficulty distribution than vanilla Self-Instruct and helps push student capability on harder tasks.

Magpie

Xu et al., 2024 (arXiv:2406.08464). A clever trick: instead of prompting the teacher with a seed, prime it with only the assistant-turn template and let it generate both a question and an answer from scratch. The teacher's own instruction-following posterior produces diverse, high-quality (prompt, response) pairs without seed bias. Has become a standard technique for harvesting alignment data from open-instruct models.

Persona-based generation

Condition the generator on a synthetic persona (a sentence or two describing a hypothetical user) to widen the distribution of generated prompts and responses. Heavily used in 2024–2026 to manufacture diversity that the raw teacher distribution lacks. The Persona Hub work and related approaches have shown that conditioning on millions of personas can produce dataset-scale diversity from a single teacher.

Constitutional generation

Use a written constitution (Bai et al., 2022 — arXiv:2212.08073) to drive the generator and the filter. The generator critiques and revises its own outputs against the constitution; survivors become training data. Originally framed for safety alignment, now used more broadly as a way to specify generation targets without enumerating them in seed examples.

RLHF traces as data

Once a model has been through preference training, the trajectories of that training — preference pairs, rollouts, RM scores, KL paths — are themselves a dataset. Replaying selected high-value examples during later SFT stages helps lock in the preference signal in a form that is robust to subsequent fine-tuning. Many frontier pipelines store and version their RLHF rollout buffers as first-class training datasets.

What frontier labs are actually doing

Public reports cohere around a few patterns:

  • OpenAI. Heavy investment in synthetic data for both pretraining mid-stages and post-training; the GPT-4 system card (cdn.openai.com/papers/gpt-4-system-card.pdf) describes substantial synthetic generation for red-team and capability evaluation data.
  • Anthropic. Constitutional AI is the public signature; synthetic preference data and constitution-guided generation are core to the recipe.
  • Microsoft (Phi family). Most aggressive public synthetic-data strategy. Phi-3 (Abdin et al., 2024 — arXiv:2404.14219) is largely synthetic-heavy in pretraining. The "textbooks are all you need" thesis (Gunasekar et al., 2023 — arXiv:2306.11644) is now a production strategy.
  • DeepSeek. R1's distillation family is the clearest public example of using a frontier reasoning model to manufacture training data for smaller students.
  • Meta (Llama). Published Llama 3 recipe describes substantial synthetic data in post-training: rejection-sampling FT, AI-feedback judges, evol-instruct-style augmentation.

The trajectory is unambiguous: synthetic data is now load-bearing, not auxiliary.


Why synthetic data exists

Three forces drove the shift to synthetic data:

The data wall

Web text is finite. Estimates of the total useful text on the open web range from 5 to 50 trillion tokens, depending on quality filtering. Frontier models in 2024-2026 are trained on much of this.

The marginal additional web token, even with aggressive de-duplication and quality filtering, contributes diminishingly to model capability. Going from 1T tokens to 10T tokens helps; going from 10T to 100T helps much less (and there isn't 100T of high-quality, non-redundant data anyway).

Specific capabilities need specific data

The web is general. Specific capabilities — math reasoning, code generation, multi-step planning, particular domain knowledge — are underrepresented in raw web data. Targeted synthetic data fills these gaps.

Quality control

Web data is noisy. Carefully generated synthetic data can be cleaner, more diverse, and more focused than equivalent real data.

The combination: synthetic data lets labs train on data the web doesn't have, in quantities the web can't provide, with quality higher than scraping allows.


Categories of synthetic data

Self-instruct and instruction generation

Models generate (instruction, response) pairs. Used heavily in SFT — see post-training: SFT, RLHF, DPO for how these pairs feed the alignment stack.

  • Seed: a few hand-written examples.
  • Model generates more in the same style.
  • Quality filtering keeps the good ones.

The original Self-Instruct paper (Wang et al., 2022) showed this could scale to hundreds of thousands of examples with reasonable quality.

Math and reasoning data

Models generate math problems and their solutions. Or solve existing problems with detailed reasoning chains.

Key advantage: math has verifiable answers. A generated (problem, solution) pair can be filtered by checking whether the solution is correct — the same property that makes verifiable-reward training work for reasoning-model serving, and that downstream eval infrastructure relies on to score candidate outputs.

Code data

Models generate code, then unit tests or other code verify correctness.

  • Generate a coding problem.
  • Generate a solution.
  • Generate test cases.
  • Run the tests; keep examples where the solution passes.

This is one of the most reliable synthetic-data domains because correctness is fully verifiable.

Distillation traces

A large model generates reasoning chains or responses; a smaller model is trained on them. See §7.

Persona / dialogue data

Generated multi-turn dialogues for SFT and conversational training. Quality varies; less verifiable than math/code.

Domain-specific synthetic data

For training models in specialized domains (medical, legal, scientific) where licensed data is scarce. Synthetic generation by domain-expert models, then human review.


Generation pipelines

A production synthetic-data pipeline has several stages, and at frontier scale it shares the same multi-node footprint as distributed LLM training — the generator is a full training-class model running inference in parallel:

1. Seed selection

Hand-curated examples that define the target style and quality. Small (tens to hundreds), high-quality.

2. Generation

A capable model (often the lab's own frontier model) generates new examples from the seeds plus instructions about what to generate.

  • Prompt variants: explicit examples plus diverse prompts to drive variety.
  • Temperature / sampling: higher diversity vs higher quality trade-off.
  • Batching: huge inference batches to make generation cheap per token.

3. Validation

Each generated example is checked:

  • Format correctness: parseable, well-formed.
  • Verifiable correctness: tests pass, math correct, etc.
  • Quality scoring: a reward model or judge model scores each.
  • De-duplication: avoid near-duplicates of existing examples.

4. Filtering

Only examples meeting all criteria are kept. The accept rate is often 10-30% — most generated examples are discarded.

5. Diversity expansion

Ensure the kept examples cover the target distribution, not just the easy parts. Techniques: clustering, intentional diversity injection, hard-example mining.

6. Mixing

Synthetic examples are mixed with real data in training. The ratio depends on the workload — too much synthetic can cause distribution narrowing; too little wastes the investment.


Quality filtering

Quality control is the bottleneck. Generation is cheap; finding the high-quality examples is hard.

Verifiable filtering

For domains with ground-truth correctness:

  • Math: symbolic equation checking, numerical evaluation.
  • Code: test suites, compilers, static analysis.
  • Logical reasoning: formal verification (limited scope).

These give crisp accept/reject signals. Most reliable; only applicable to verifiable domains.

Model-based filtering

For non-verifiable domains:

  • Reward models: trained on human preferences.
  • Judge models: LLMs prompted to score outputs.
  • Heuristic models: classifiers for specific quality dimensions.

These are noisier but cover everything verifiable filtering can't.

Human review

For the highest-stakes data: human review of generated examples. Expensive; reserved for seed sets, calibration, and audit.

The error-error problem

If the model generating data has systematic errors, and the filter is the same model (or a similar one), the filter will miss those errors. Independent verification methods or human spot-checks mitigate this.

Diversity filtering

A subtle quality dimension: even if individual examples are good, the set may be too narrow. Techniques to ensure coverage:

  • Embedding-based de-duplication.
  • Topic/domain stratification.
  • Forced injection of underrepresented cases.

Quality filtering of synthetic data — deeper dive

The previous section sketched the categories of filters. In practice, the difference between a synthetic-data pipeline that improves a model and one that quietly degrades it lives in the details of the filter stack. A few patterns are worth understanding deeply.

The "two filters" rule

A robust filter is rarely a single signal. Production pipelines stack independent filters so that a generated example must pass several different kinds of checks. A typical stack for math reasoning data:

  1. Format filter. Does the response parse? Does it contain a final answer in the expected position?
  2. Verifier filter. Does the answer match ground truth via symbolic or numerical check?
  3. Reasoning-quality filter. Does an LLM judge consider the reasoning coherent and non-trivial?
  4. Diversity filter. Is this example near-duplicate of others in the kept set (embedding distance check)?
  5. Difficulty filter. Is this problem in the right difficulty band — not so easy the student already solves it, not so hard that even the teacher fails most of the time?

Each filter is cheap relative to generation. Composing them gives a much stronger combined signal than any single filter alone. Accept rates after the full stack are often in the 5–15% range; the discarded majority is the cost of doing it right.

Cross-validation filtering

A subtle filter that has become standard: keep an example only if multiple independent generations from the teacher (different seeds, different temperatures, sometimes different teacher models) agree on the answer. Disagreement is a strong signal that the problem is ambiguous, the teacher is uncertain, or the example is mis-formed. This is also called "majority voting" or "consistency filtering" and is one of the most effective single filters for math and code synthesis.

Length and surface-form filters

Naive filters that catch a surprising amount of garbage:

  • Length bands (too short usually means a failed generation; too long often means a degenerate loop).
  • Repetition detection (n-gram overlap within the response).
  • Format compliance (markdown structure, code-fence balance, expected sections).
  • Language detection (catches the multi-language drift R1-Zero exhibited).

These filters do not assess substance. They are cheap and they catch a lot of obvious failures cheaply, freeing the more expensive judge models to focus on substantive quality.

Difficulty calibration

A failure mode of synthetic data is generating examples the student already solves easily. The student gets no useful gradient from these; they take up budget and dilute the harder examples that actually move the needle.

The fix is to filter by the student's own current performance. Generate many candidates, have the student attempt each problem, keep only the problems where the student's pass rate is in a target band (often 20–60%). This produces a difficulty-calibrated training set that is concentrated where the gradient is largest. It is also a form of on-policy data selection — the kept set changes as the student improves.

Judge-model failure modes

LLM judges are now standard but they have well-documented failure modes. Production pipelines should be aware of:

  • Position bias. Judges prefer the first response in a pair more often than chance would predict. Mitigation: average judgments across both orderings.
  • Length bias. Longer responses are scored higher absent any quality difference. Mitigation: explicit length normalization or pairing only same-length responses.
  • Self-preference. Models judge their own outputs more favorably than other models'. Mitigation: use a different model family as judge, or use an ensemble.
  • Markdown / formatting bias. Heavy formatting boosts scores. Mitigation: strip or normalize formatting before judging substance.

Treating the judge as another component with measurable failure modes — and evaluating it on held-out human-labeled data — is what separates teams that ship from teams that deploy biased filters and don't know it.

Audit and drift detection

Filters drift. The generator changes, the student changes, the prompt distribution changes, the judge model gets updated. A pipeline that worked last quarter may quietly produce worse data this quarter. Production pipelines run continuous audits: random sampling of accepted examples, periodic human review against a fixed rubric, tracking of accept-rate distribution across categories.

The cheap monitoring signals are accept rate per category and per generator version. The expensive but indispensable signal is human evaluation of a random sample of kept examples. Skip the latter and your pipeline will quietly rot.


The model-collapse question

A widely-discussed concern: training on model-generated data degrades quality. Successive generations of synthetic data, fed into subsequent training, lead to "model collapse" — narrowing distributions, loss of rare-but-important patterns.

The empirical findings

The literature (notably Shumailov et al., 2023, "The Curse of Recursion") demonstrates collapse in controlled settings: when you train a model purely on data generated by its predecessor, quality degrades over generations.

But:

  • Real production pipelines mix synthetic with real data.
  • Quality filtering removes the worst synthetic examples.
  • Generators are often different, stronger models than the trainee.

Under these conditions, collapse is largely mitigated. Most production training that uses synthetic data does not show collapse in practice.

What still goes wrong

  • Topic narrowing: if synthetic generation systematically over-represents some topics, training inherits the bias.
  • Style narrowing: synthetic examples often share a recognizable style. Trained models inherit it.
  • Rare-pattern loss: examples that are individually low-quality but distributionally important may be filtered out.

The defenses are mostly procedural: maintain a diverse mix, periodic audits, hold-out evaluations specifically targeting tail behaviors.


Distillation: knowledge transfer to smaller models

Distillation: train a smaller "student" model to mimic a larger "teacher" model.

The original distillation idea (Hinton et al., 2015) used soft labels — the teacher's full output distribution rather than just argmax — to provide richer training signal. For LLMs, distillation typically uses the teacher's generated text as training data.

Why it works

A frontier-scale model has learned to do many things well. A smaller model trained on the teacher's outputs gets supervised by behavior, not just by the original training data. The student often achieves quality much higher than what training from scratch on the same data would produce.

Capability transfer ceiling

The student model has a capability ceiling set by its parameter count and architecture. Distillation can fill that ceiling more effectively than other training methods, but it can't exceed it.

A 7B model distilled from a 700B teacher will outperform a 7B model trained from scratch on web data, but won't approach the 700B teacher's capability.

Production deployment

Distillation is the workhorse of cost-effective inference:

  • Frontier capability is expensive per token.
  • Distilled smaller models capture most of that capability at a fraction of the cost.
  • Routing easy traffic to the small model, hard traffic to the large, optimizes the cost/quality curve.

Distillation methods

Hard distillation (sequence-level)

Teacher generates responses to prompts. Student trains on (prompt, teacher-response) pairs as standard SFT.

  • Simple.
  • Loses the teacher's full probability distribution.
  • Most production distillation is this kind.

Soft distillation

Student matches the teacher's full token distribution (KL divergence between student and teacher output distributions).

  • Captures more information per example.
  • Requires running the teacher inference during training (expensive).
  • Better quality for the same student size.

Reasoning distillation

Distillation of explicit reasoning chains. A frontier reasoning model produces long reasoning chains; the student is trained to produce them.

This is the dominant mechanism for democratizing reasoning capability — labs without frontier-model resources can train competitive smaller reasoning models from distilled traces of stronger ones.

Preference-based distillation

The teacher's preferences (which of two responses is better) train the student via DPO or RLHF. Combines distillation with preference learning.


Knowledge distillation: which signals transfer

A practical question that gets less attention than it deserves: when you distill a teacher into a student, which aspects of the teacher's capability actually transfer, and which do not? The honest empirical picture is partial and field-developing, but a few patterns are stable enough to plan around.

Style and surface form transfer almost completely

A student trained on a teacher's responses inherits the teacher's writing style, refusal patterns, formatting conventions, and conversational register with very little loss. This is the part of distillation that "just works." If your teacher is concise and your student should be concise, response distillation will give you that for almost free. If your teacher hedges a lot, your student will hedge a lot.

A corollary worth noting: stylistic fingerprints are how researchers identify when an open-weight model has been trained on closed-API outputs. The fingerprint of the teacher is preserved more strongly than most teams realize.

Instruction-following transfers well

The general capability "respond appropriately to instructions" transfers well across the parameter-count gap. Alpaca (Taori et al., 2023) demonstrated that a 7B Llama base could acquire most of GPT-3.5's instruction-following ability from just 52K Self-Instruct examples. The exact capability ceiling depends on the student's pretrained competence, but the interface transfers cleanly.

Domain knowledge transfers up to the student's capacity

Factual knowledge from the teacher's training appears in distilled examples and is partly absorbed by the student. The student does not become a perfect copy of the teacher's knowledge — its parameter count and pretraining set limit how much it can hold. But teacher-specific facts that show up in generated examples are retained at rates roughly proportional to how often they appear and how well the student's pretraining supports them.

Reasoning transfers more than you'd expect, but with caveats

The DeepSeek-R1 distillation result was striking: distilled smaller models retain a surprisingly large fraction of R1's reasoning capability. A reasonable hypothesis is that the long chain-of-thought traces themselves carry most of the signal — they make the reasoning explicit in a form that supervised learning can absorb. The capability that transfers is "produce reasoning of this shape." Whether the student can actually solve harder problems beyond its parameter-count ceiling is unclear; what is clear is that it learns to attempt problems in the teacher's style.

The caveat: distilled reasoning models are bounded by the teacher's correctness in the training data. If the teacher gets a class of problems wrong, the student inherits those errors. Filtering by verifiable correctness during the distillation step largely solves this for verifiable domains.

Calibration and uncertainty transfer poorly

A persistent finding: students inherit the teacher's confidence levels rather than the teacher's accuracy. When the teacher is wrong but confident, the student becomes wrong and confident. Calibration is one of the more brittle properties under distillation, and explicit calibration fine-tuning is often needed afterwards.

What does not transfer

  • Pretraining-bound knowledge the teacher has but the student's parameters cannot hold. There is a hard capacity ceiling.
  • Behaviors the teacher only exhibits rarely (long-tail capabilities that don't show up in the generated set).
  • Tool-use precision in many cases — students learn the form of tool calls from teacher traces but often fail at the precise arguments.
  • Multi-step planning beyond what the student's own pretraining can support. Distillation can elicit slightly more than the student's baseline, but the gap to the teacher remains large on planning-heavy tasks.

The practical implication

Plan your distillation around what transfers. Style, instruction-following, formatting, refusal patterns, and the shape of reasoning are easy wins. Pure knowledge transfer is bounded by capacity. Calibration needs an additional explicit step. And true planning capability is mostly a function of the student's own pretraining; distillation will not paper over a weak base model.


Self-improvement and bootstrapping

A particularly interesting pattern: a model improves itself by generating data, filtering it, and training on the survivors.

The basic loop

  1. Model generates examples (often using techniques like chain-of-thought or multi-sample voting).
  2. Examples are filtered by automatic verifiers (test suites, math checkers, reward models).
  3. Surviving examples train an improved model.
  4. The improved model generates better examples. Repeat.

STaR (Self-Taught Reasoner)

Zelikman et al., 2022 demonstrated this loop for math reasoning. The model generates reasoning chains; chains that lead to correct answers are kept; the model is fine-tuned on them. Performance improves over iterations.

Verifier-driven bootstrapping

For verifiable domains, the loop is robust because the verifier provides ground truth. The model can't fool itself.

Limits

  • The improvement plateaus at the verifier's quality ceiling.
  • For non-verifiable domains, bootstrapping is harder; bad examples can compound.

The DeepSeek-R1 recipe and related reasoning-model work make heavy use of this pattern with verifiable rewards.


Self-improvement loops at frontier labs

The self-improvement loop has gone from research curiosity to production strategy. The basic mechanism is the same as STaR but the engineering and the scale are different. A few patterns are visible in the public record.

The frontier loop, abstracted

  1. A strong checkpoint generates candidate responses (or reasoning chains, or judgments) for a large prompt set.
  2. A filter — verifier, judge model, or reward-model ensemble — selects the survivors.
  3. The survivors train the next checkpoint, either via SFT or via RL using the filter output as the reward.
  4. The new checkpoint becomes the generator for the next round.

This is not exotic. It is what every frontier lab now does, in various flavors, between major model releases.

Self-Rewarding Language Models

Yuan et al., 2024 (arXiv:2401.10020). A single model serves as both generator and judge. It generates responses and judges them; the resulting preferences train it via DPO. Over iterations, both its responses and its judgments improve. The signature observation: judgment ability improves alongside generation ability, which suggests the loop is not just amplifying a fixed signal — it is genuinely extracting and refining latent capability.

The caveat: the loop is bounded by what the model can in principle assess. For tasks where the judge is wrong in the same direction as the generator is wrong, self-rewarding amplifies the error.

Constitutional AI as a self-improvement loop

The Anthropic Constitutional AI recipe (Bai et al., 2022 — arXiv:2212.08073) is a self-improvement loop with an explicit alignment target. The constitution acts as an external anchor that prevents the loop from drifting into local minima. The generator critiques and revises its own outputs against the constitution; the survivors train the next round. Constitutional AI is the strongest example of how an external specification — even a paragraphs-long written rubric — can stabilize a self-improvement loop that would otherwise drift.

Reasoning self-improvement at DeepSeek

DeepSeek's R1 recipe is the most publicly documented case of a self-improvement loop combined with verifiable rewards. The R1-Zero ablation shows that a base model running pure-RL self-improvement against verifiable math and code rewards develops sophisticated reasoning behavior on its own — no human data in the inner loop. The production R1 recipe layers a small SFT stage on top to clean up the format, then runs further RL, then distills the resulting model into smaller students. Each loop iteration is itself a self-improvement pass.

Reported patterns at OpenAI and Anthropic

Less is public, but credible reports describe similar patterns. The o-series reasoning models reportedly use large-scale self-improvement loops over verifiable problem sets. Anthropic's reasoning modes appear to combine constitutional self-critique with verifier-driven loops. The exact recipes are proprietary; the structural pattern — generator → filter → train → generator — is shared across the field.

Why this works at frontier scale

The labs have three advantages that make self-improvement loops particularly powerful for them:

  • Compute to run rollouts at scale. Generating millions of candidates per loop iteration is feasible only with substantial inference infrastructure.
  • High-quality filters. Strong verifiers (for math, code), strong judge models, large RM ensembles. The quality of the filter sets the ceiling of the loop.
  • Iteration speed. Frontier labs can run many short loops in parallel, ablate which filter / generator combinations work, and pick the survivors. The loop is itself part of a larger experimentation portfolio.

The limits

Self-improvement loops are bounded by the filter. If the filter has a blind spot — a class of subtly-wrong responses it scores highly — the loop amplifies that blind spot. The defenses are external evaluation, periodic human review, diversity in filter design (ensembles, different model families as judges), and explicit anchors (constitutions, verifiable rewards) wherever they apply.

A failure mode worth naming: "mode collapse via self-improvement." A loop run too long against a single judge collapses the generator's distribution toward the judge's preferred outputs. The signal looks like accept rates rising while diversity falls. Production pipelines mix the synthetic data with real data, run multiple parallel loops with different filters, and explicitly monitor diversity to prevent this.


Verifiable-reward generation

The most reliable synthetic-data pattern: generate problems, generate solutions, verify automatically.

Math

  • Generate a math problem (varying difficulty).
  • Generate a solution with reasoning.
  • Symbolically or numerically check the solution.
  • Keep only correct ones.

Scales to millions of high-quality training examples in math reasoning.

Code

  • Generate a programming problem.
  • Generate a solution.
  • Generate test cases.
  • Run the tests in a sandbox.
  • Keep only solutions that pass.

This is how labs train models specifically on code reasoning at huge scale.

Other verifiable domains

  • Theorem proving (with formal proof assistants).
  • Game playing (with game-state evaluation).
  • SQL generation (with query execution against test databases).
  • Structured data extraction (with format validation).

The frontier of synthetic data is partly about expanding what counts as "verifiable" — bringing more domains into the regime where automatic filtering works.


Synthetic data for safety (red-team data generation)

Safety post-training has its own synthetic-data subspecialty. The same generation-and-filter pattern applies, but the objective is different: surface failure modes that the model should learn to refuse or handle correctly, without inadvertently teaching the harmful capability.

What red-team synthetic data looks like

The data consists of (prompt, target-response) pairs where:

  • The prompt is an adversarial or unsafe request — a jailbreak attempt, a request for harmful content, a deceptive framing, a tricky edge case.
  • The target response is the desired safe behavior — a refusal with explanation, a redirection, a safety-aware partial answer, or a careful handling of an ambiguous case.

Production safety pipelines generate millions of such pairs spanning the full taxonomy of safety concerns: harmful capability requests, deceptive prompts, identity-based attacks, manipulation attempts, privacy violations, and many more categories.

Generation strategies

  • Taxonomy-driven generation. Start from a written taxonomy of harm categories. For each category, prompt a generator to produce a wide variety of attempts in that category. Cover obvious cases plus creative variations.
  • Jailbreak-style synthesis. Use known jailbreak patterns (role-play framings, multi-turn manipulation, indirect requests, prompt injection) as templates, generate variations, and produce target safe responses for each.
  • Persona conditioning for adversarial diversity. Condition the generator on adversarial-user personas to widen the distribution of attack styles. The constitutional AI recipe (Bai et al., 2022 — arXiv:2212.08073) uses something similar.
  • Boundary-case generation. Many safety failures live in ambiguous regions where reasonable responses differ. Generating examples that explicitly probe the boundary — and labeling the desired response — is one of the higher-leverage uses of red-team synthesis.

The teaching-capability concern

A persistent worry: synthetic red-team data could inadvertently teach the harmful capability while teaching the refusal. The standard mitigation is to keep the harmful detail in the prompt and out of the target response, and to filter generated data for examples where the target response itself leaks unsafe content. In practice this filtering is the most labor-intensive part of safety synthesis.

The GPT-4 system card (cdn.openai.com/papers/gpt-4-system-card.pdf) is the most thorough public discussion of how a frontier lab structures this work: an explicit red-team process produces high-quality seed examples; synthetic generation amplifies them at scale; multiple layers of review filter for capability leakage.

Verifying safety-data quality

Safety data is harder to verify than math or code. The "right answer" is a judgment call, not a deterministic check. Filters typically combine:

  • A safety classifier (often a specialized smaller model) checking that the target response is actually safe.
  • A judge model evaluating whether the target response handles the prompt appropriately (refuses with explanation, redirects, clarifies, etc., as appropriate).
  • Human review of a sample, especially on edge cases.

Composition with capability training

A frontier safety post-training stage is rarely just synthetic red-team SFT. The data feeds into a multi-stage pipeline: SFT on safety pairs, DPO with safety preference data, then a final pass that mixes safety data with capability data to prevent regression. The safety stage cannot stand alone — without capability data, the model becomes excessively cautious. The mixing ratio is itself a tuning parameter.

The honest limits

Synthetic red-team data covers known categories well. It is less effective at covering unknown unknowns — failure modes that the taxonomy did not anticipate. Continuous adversarial probing (human red-teamers, automated jailbreak search, deployment monitoring) is required to find new failure modes, which then feed back into the synthetic generation step. The loop is permanent; no static synthetic safety dataset stays current for long.


Infrastructure for synthetic generation

Generating billions of synthetic examples is itself a large inference workload.

Massive batch inference

Synthetic data generation is throughput-optimized, not latency-optimized. Big batches, large prompt context, long outputs.

Common patterns:

  • Dedicated inference clusters separate from production serving.
  • High batch sizes (256+).
  • FP8 or INT4 weights for cost — see quantization tradeoffs for what's safe.
  • Possibly older hardware (since latency doesn't matter).
  • Aggressive use of speculative decoding where applicable to cut wall-clock per token.

Verification compute

Each verification step is its own workload:

  • Math checkers (CPU-bound, fast).
  • Code execution (sandboxed, slower).
  • Judge models (LLM inference, similar to generation cost).

For some pipelines, verification compute exceeds generation compute.

Storage and indexing

Trillions of generated tokens must be stored, deduplicated, indexed for retrieval. This is data-engineering at scale, with all the usual concerns: data lake architectures, embedding-based search, versioning.

Quality monitoring

The pipeline produces data continuously; quality monitoring runs alongside:

  • Accept-rate tracking.
  • Distribution drift detection.
  • Periodic random audits.

Production deployments

What real labs do:

Frontier labs (OpenAI, Anthropic, Google): synthetic data is a substantial fraction of training mix. Pipelines are proprietary but substantial — multi-team engineering investments.

Open-weights labs (Meta, Mistral, DeepSeek, Qwen): published recipes increasingly describe synthetic data pipelines. DeepSeek-R1's recipe is detailed; LLaMA's are partially documented.

Smaller labs and companies: use synthetic data heavily for domain-specific fine-tuning. Often start with hand-written seeds + frontier-model generation + filtering.

Distillation deployments: routing systems that use smaller distilled models for most traffic and larger models only when needed. Common across hosted providers.


Open problems

Synthetic data for non-verifiable tasks. Most reliable patterns are in verifiable domains. Extending the rigor to subjective domains (writing, creativity, judgment) is open.

Long-horizon synthetic data. Generating quality examples that span many steps or long contexts is harder than generating short examples.

Detecting subtle quality issues. A model trained on filter-passing synthetic data still inherits the filter's biases. Better quality-control methods are an active area.

Cross-modal synthetic data. Generating synthetic image-text or video-text data with quality matching real curated data.

Synthetic data for safety. Generating examples that improve model safety without inadvertently teaching harmful capabilities.

Distillation that exceeds the teacher. Standard distillation is bounded by teacher quality. Active research into whether students can in some sense exceed their teachers (through curriculum, multi-teacher distillation, or self-improvement).


Open datasets and recipes worth studying

The open ecosystem has accumulated enough public synthetic-data work that you can reproduce most of the recipes without lab-internal access. The datasets and reports below are the ones worth reading end-to-end before designing a pipeline.

Dataset / recipe Source Approx. size What it demonstrates
Alpaca Stanford (Taori et al., 2023) 52K instructions Self-Instruct from GPT-3.5 onto Llama-7B base. The kickoff dataset.
WizardLM (Evol-Instruct) Xu et al., 2023 ~250K instructions Difficulty evolution; covers harder instruction-following.
OpenHermes / OpenHermes-2.5 Teknium ~1M conversations Aggregated multi-source synthetic instruct data.
UltraChat Tsinghua, 2023 ~1.5M dialogues Multi-turn synthetic dialogue at scale.
UltraFeedback Cui et al., 2023 ~64K preference pairs AI-feedback preference data for DPO.
Magpie Xu et al., 2024 up to 4M pairs Template-prime trick for diverse synthesis.
OpenOrca Lian et al., 2023 ~4M examples Distillation of GPT-4 reasoning traces.
MetaMathQA Yu et al., 2023 ~395K math problems Verifier-filtered synthetic math.
Code-Alpaca / Magicoder Wei et al., 2023 up to ~110K code samples Self-Instruct for code with execution-filtering.
Tülu 3 SFT mix Lambert et al., 2024 ~1M examples Reference open post-training data mix; documented composition.
DeepSeek-R1 distillation set DeepSeek, 2025 ~800K reasoning traces Reasoning distillation from a frontier reasoning model.
Persona Hub Tencent, 2024 ~1B personas Persona-conditioned generation at web scale.

What to read in each

The headline takeaways: Alpaca shows the cheapest possible recipe and its limits. WizardLM's Evol-Instruct introduces difficulty evolution as a standalone technique that compounds with any seed-based pipeline. The Tülu 3 report (Lambert et al., 2024 — arXiv:2411.15124) is the single most useful document for understanding a modern open post-training data mix; the per-category composition tables alone repay study. DeepSeek-R1's appendix documents the rejection-sampling reasoning distillation pipeline in enough detail to reproduce on smaller scales. The Persona Hub release shows how persona-conditioning unlocks distributional diversity that single-seed pipelines cannot match.


Economics of synthetic-data pipelines

The economics of synthetic data are unusual: generation compute and verification compute are the major line items, human labor is small after pipeline setup, and the unit cost per accepted example drops by 1–2 orders of magnitude with engineering investment. Understanding this cost shape is what separates teams that scale pipelines efficiently from teams that burn budgets generating garbage.

Cost per accepted example, by domain

Approximate 2026 figures on commodity inference infrastructure:

Domain Generator cost per attempt Verifier cost per attempt Accept rate Cost per accepted example
Math (verifier-filtered) $0.005 (one teacher pass) $0.0001 (symbolic check) 30–60% ~$0.01–$0.02
Code (test-filtered) $0.01 (longer generation) $0.001 (sandbox execution) 20–50% ~$0.02–$0.06
Self-Instruct chat $0.002 (short generation) $0.001 (judge model) 40–70% ~$0.004–$0.008
Reasoning trace distillation $0.05–$0.20 (long CoT) $0.005 (verifier or judge) 10–30% ~$0.20–$2.00
Constitutional safety pairs $0.01 (critique + revise) $0.003 (safety judge) 15–30% ~$0.05–$0.10

A 1M-example synthetic math dataset thus costs $10K–$20K of pure compute to produce; a 1M-example reasoning-distillation dataset can cost $200K–$2M. The difference between these orders of magnitude is mostly trace length and the accept rate of expensive verifiers. The optimization that moves the needle most: improving accept rates via better prompts, before adding compute.

Where engineering investment pays off

The biggest single cost reduction in any synthetic-data pipeline comes from prompt engineering on the generator. A 2× improvement in accept rate halves the cost per accepted example, and prompt engineering routinely produces 2–10× accept-rate gains for a fixed engineering week. The second largest win is verifier reuse — sharing one verifier deployment across many concurrent generation streams. Generation parallelism is third; once accept rate and verifier throughput are tuned, throwing more inference compute at the problem is the lever that scales most predictably.

Compute mix: training-class vs older hardware

Synthetic generation is throughput-bound, not latency-bound. This is the right workload for older or cheaper hardware: H100s instead of B200s for the generator, A100s for the verifier model, CPU farms for symbolic checks and code execution. The generator does not need a serving SLA; it can run with very large batches, FP8 weights, aggressive speculative decoding, and overnight scheduling on spot-priced capacity. The cost gap between an optimized batch-generation cluster and a naive production-inference deployment can exceed 5×.


Detection: how researchers spot distilled models

A practical concern for anyone shipping a distilled model: how easy is it for outside researchers to detect that a model has been trained on a specific teacher's outputs? The honest answer in 2026 is: easier than most teams realize.

Surface-style fingerprints

Frontier teachers have recognizable writing patterns — specific phrasings, common refusal templates, characteristic markdown habits, signature reasoning openings ("Let me think about this step by step..."). A student trained on a teacher's outputs inherits these surface fingerprints with high fidelity. Researchers have demonstrated that simple n-gram and embedding-based detection can identify the teacher with >90% accuracy on most distilled models, especially when the distillation set is not heavily filtered or mixed with diverse other data.

Knowledge fingerprints

A teacher's specific factual errors, idiosyncratic opinions on contested questions, and characteristic ways of framing ambiguous topics show up in student outputs. The "do you know about [specific obscure topic the teacher would not know]?" probe is a classic detection technique — a student that "knows" exactly the same obscure facts as the teacher, including the teacher's misconceptions, is a strong indicator of distillation.

Behavioral fingerprints

Teachers have characteristic latency-quality tradeoffs, refusal behaviors on borderline prompts, and edge-case handling. A distilled student often inherits these even when the surface text differs. Adversarial probing — prompts designed to elicit teacher-specific behaviors — is a more reliable detection technique than surface analysis alone.

Defenses

For teams that need to distill but want to avoid attribution: heavy filtering and rewriting, mixing with diverse other data sources, paraphrasing teacher outputs through an intermediate model, and explicit anti-fingerprint fine-tuning can reduce but not eliminate the signal. The most effective defense is to use the teacher only for capability shaping and to do the bulk of post-training with a different teacher or with synthetic-from-scratch approaches like Magpie applied to a different base.

The legal angle

Most frontier API terms of service explicitly prohibit using outputs to train competitor models. Detection methods are now mature enough that pretending compliance is risky. Open-weight teachers (Llama, Qwen, DeepSeek, Mistral families) are the safer choice for commercial distillation; their licenses generally permit synthetic-data generation for downstream training.


Dataset deep dive: Alpaca through Tulu 3 and the post-training canon

A tour of the open instruction-tuning datasets that defined post-training in 2023–2026. Each had a specific role; together they're the canon serious labs work from.

Alpaca (Stanford, March 2023)

github.com/tatsu-lab/stanford_alpaca. 52k instructions generated by GPT-3.5 via Self-Instruct seeded from 175 human-written tasks. The first widely-replicated demonstration that a 7B model fine-tuned on synthetic instructions could match much larger models on chat benchmarks. License: research-only due to OpenAI ToS at the time.

Vicuna / ShareGPT (UC Berkeley, March 2023)

70k user-shared ChatGPT conversations harvested from ShareGPT. Vicuna-13B fine-tuned on this data scored ~90% of ChatGPT quality in the GPT-4-judge eval that the team also pioneered. Foundational for the open chat-model lineage. License: ambiguous (user-generated content with no clean license).

WildChat (Allen AI, 2024)

huggingface.co/datasets/allenai/WildChat-1M. 1M real ChatGPT conversations collected with explicit user consent via an alternative GPT-3.5/GPT-4 interface. Cleaner license than ShareGPT; broader coverage. Used heavily for instruction fine-tuning.

OpenAssistant Conversations (LAION, 2023)

Crowd-sourced conversations + ratings. ~10k high-quality dialog trees. Used as preference data for early open RLHF models. Apache 2.0.

UltraChat (Tsinghua, 2023)

1.4M multi-turn conversations generated by two ChatGPT instances roleplaying. Used to train Zephyr-7B (HuggingFace, Oct 2023) and many successors. Important for multi-turn fine-tuning data at scale.

UltraFeedback (Tsinghua, 2023)

Preference data: 64k prompts × 4 model responses × scores from GPT-4. Used as preference data for DPO and similar methods. The default open-weight preference dataset 2023–2024.

OpenHermes (Nous Research, 2023)

1M+ instruction-following examples curated from multiple sources. Used to train Nous-Hermes family. License: mixed but largely permissive.

OpenOrca and SlimOrca (2023)

OpenOrca reproduced Microsoft's Orca paper (synthesizing explained reasoning from GPT-4 over FLAN tasks). 4M examples. SlimOrca is a filtered subset (518k high-quality examples) — strong cost-quality tradeoff.

Nemotron-CC (NVIDIA, 2024)

research.nvidia.com/labs/adlr/Nemotron-CC. 6.3T tokens reformulated from Common Crawl using NVIDIA's Nemotron-4. Reformulation = take low-quality web text, rewrite with the model into higher-quality educational text. The Nemotron family pioneered this approach at trillion-token scale.

DCLM (Apple, MIT, July 2024)

datacomp.ai. DataComp-LM. A competition-style dataset benchmark. DCLM-Baseline is a 3.8T-token cleaned web dataset that became the new high-quality web baseline.

FineWeb / FineWeb-Edu (HuggingFace, 2024)

huggingface.co/datasets/HuggingFaceFW/fineweb. FineWeb is 15T cleaned web tokens; FineWeb-Edu (Aug 2024) is the educational subset, ~1.3T tokens, filtered by a classifier trained to predict educational value. Smaller models trained on FineWeb-Edu outperform same-size models trained on raw web — a clear illustration that quality > quantity.

Cosmopedia (HuggingFace, Feb 2024)

huggingface.co/datasets/HuggingFaceTB/cosmopedia. 25B tokens of synthetic textbook-style content generated by Mixtral-8x7B. The largest fully-open synthetic pretraining dataset at release. Demonstrated open-community reproduction of Phi-style synthetic pretraining.

MathPile, OpenMathInstruct, NuminaMath

Math-specific datasets. MathPile (2023): 9.5B tokens of curated math content. OpenMathInstruct (NVIDIA, 2024): 1.8M math problems with synthetic solutions. NuminaMath (Numina, 2024): 860k math problems with verified solutions — the dataset that won NeurIPS-2024 math olympiad.

Tulu 3 (Allen AI, November 2024)

allenai.org/tulu. A complete open recipe for post-training: 960k SFT examples + RLVR + DPO + safety tuning. Tulu-3-70B matched Llama-3.1-70B-Instruct on most benchmarks; the recipe was fully documented and reproducible. Reference recipe for serious open post-training in 2025.

DeepSeek-Coder / StarCoder data

DeepSeek-Coder data (2024): 2T+ tokens of code. StarCoder (BigCode, 2023): The Stack v2 (~3T cleaned permissive-license code). These define the open-code-data canon.

RedPajama, Dolma, Common Crawl Stuff

RedPajama-v2 (Together, 2023): 30T web tokens with quality scores. Dolma (Allen AI, 2023): 3T multi-source tokens. These plus Common Crawl form the open web-data baseline.

Summary table

Dataset Year Tokens / examples Purpose License
Alpaca 2023 52k examples SFT seed Research-only
ShareGPT 2023 ~70k convs SFT (early) Ambiguous
WildChat 2024 1M convs SFT Permissive
OpenAssistant 2023 10k dialogs SFT + preference Apache 2.0
UltraChat 2023 1.4M convs SFT multi-turn MIT
UltraFeedback 2023 64k prompts DPO preference MIT
OpenHermes 2.5 2023 1M examples SFT mix Mixed permissive
OpenOrca / SlimOrca 2023 4M / 518k SFT with reasoning MIT
Nemotron-CC 2024 6.3T tokens Pretraining (reformulated) NVIDIA terms
DCLM-Baseline 2024 3.8T tokens Pretraining (web) Various
FineWeb / FineWeb-Edu 2024 15T / 1.3T Pretraining ODC-By
Cosmopedia 2024 25B Pretraining (synthetic) Apache 2.0
Tulu 3 SFT 2024 960k SFT recipe ODC-By
NuminaMath 2024 860k Math SFT/RL Apache 2.0
The Stack v2 2024 ~3T Code pretraining Permissive licenses

Pretraining synthetic datasets: Cosmopedia, Nemotron-CC, FineWeb-Edu

The 2024–2026 shift in pretraining: less raw web, more curated and synthetic content. Three exemplars and the lessons each carries.

Phi family (Microsoft, 2023–2025)

Phi-1 (June 2023) trained 1.3B parameters on ~7B tokens of "textbook-quality" synthetic data — code-explanation textbooks generated by GPT-3.5/4. Achieved HumanEval ~50%, comparable to much larger models.

Phi-1.5, Phi-2 followed with broader synthetic content. Phi-3-mini (3.8B, April 2024) trained on 3.3T tokens, ~70% synthetic/curated, achieving MMLU ~69%. Phi-4 (14B, Dec 2024) continued the recipe.

Lessons: synthetic data quality + careful curation beats scale; small models trained on high-quality data outperform large models trained on raw web; the bottleneck is "what good educational text looks like at scale."

Nemotron-CC: rewrite the web

NVIDIA's pipeline: take a Common Crawl document, prompt a strong model ("rewrite this as a high-quality educational article"), keep the rewrite as a training example. Applied at 6.3T-token scale.

The defensible insight: most web text is structurally low-quality (boilerplate, ads, repetition) but contains useful information. Rewriting transforms quality while preserving information.

Costs: rewriting 6T tokens at frontier-API rates would cost hundreds of millions; NVIDIA used in-house Nemotron-4 340B with batch inference + custom kernels to bring effective cost to a manageable level.

FineWeb-Edu: filter ruthlessly

HuggingFace's approach: train a classifier (small model) to predict whether a document is "educational"; keep only documents scoring high. Applied to 15T-token FineWeb to yield 1.3T-token FineWeb-Edu.

Result: 1.5B models trained on FineWeb-Edu outperform same-size models trained on FineWeb (raw) by 2–4 points on MMLU. The filtering is cheap (forward pass per document); the quality lift is real.

The pretraining mix in 2026

Frontier pretraining mixes in 2026 typically use:

  • 30–60% high-quality curated/synthetic content (Cosmopedia-style, Nemotron-CC-style).
  • 30–50% high-quality filtered web (FineWeb-Edu, DCLM-Baseline).
  • 5–15% code, math, scientific papers.
  • 1–5% multilingual.
  • 1–5% reasoning traces (R1-style for reasoning capability transfer).

Each lab's exact mix is closely held; the directional shift toward synthetic-heavy is public.


Synthetic instruction pipelines: Evol-Instruct, Self-Instruct, Magpie, AutoIF

How modern instruction datasets are actually generated. Each technique has a different operating principle.

Self-Instruct (Wang et al., 2022)

Seed with ~175 human-written examples; prompt a strong model to generate similar examples; deduplicate; iterate. Used to create Alpaca. Simple, scalable, but quality varies.

Evol-Instruct (WizardLM, 2023)

Take a seed instruction; iteratively "evolve" it via two operators: deepen (add constraints, increase complexity) and broaden (change topic, generalise). Produces a diverse, increasingly-hard instruction set. WizardLM-30B trained on Evol-Instruct data was state-of-the-art for open models at release.

Magpie (UMass, 2024)

arxiv.org/abs/2406.08464. A different trick: prompt an instruction-tuned model with only the assistant-turn template (no user message); the model "imagines" the user turn it would have responded to. Generates instructions for free, scaled to millions. Quality competitive with curated datasets.

AutoIF (Alibaba, 2024)

Automatic instruction-following data: generate (instruction, response, verifying-code) triples where the code verifies the response satisfies the instruction. Yields a self-verifying training set; high-quality data for instruction-following capability.

Persona-driven generation

Generate instructions conditioned on a persona ("you are a confused beginner asking about X"). Diversifies the instruction distribution; covers user populations not in raw scrapes. Used in the WildChat curation and persona-hub datasets.

Multi-turn from web seeds

Take a web article; generate a multi-turn Q&A conversation about it. Produces grounded multi-turn data with rich contextual reasoning. Used in UltraChat and others.

Quality vs quantity

A 2024 emerging finding: 100k high-quality instructions beat 1M average-quality ones for SFT. Quality filtering (next section) matters more than raw generation throughput.


Distillation deep dive: logit, response-only, on-policy, MiniLLM, DistillKit

Distillation has multiple flavors. Each transfers different signals.

Response-only distillation (sequence-level)

Generate responses from the teacher; train the student via standard cross-entropy on the responses (treat them as gold labels). Simple, no requirement to access teacher logits. The bulk of open-community distillation (Alpaca, Vicuna, R1-Distill) is this.

Logit distillation (token-level KL)

Train the student to match the teacher's full output distribution at each token, not just the argmax. Requires access to teacher logits (expensive to store; impossible if teacher is closed). Transfers richer information per token. Used in research and lab-internal distillation; rare in open community.

On-policy distillation

Have the student generate its own outputs; have the teacher score or correct them; update the student. The student learns from its own mistakes rather than from teacher outputs it would never have generated. Used in MiniLLM (Microsoft, 2023) and similar.

Off-policy distillation

The student trains on teacher-generated outputs (the student didn't produce them). The default mode. Cheaper but the student may struggle with distribution shift.

MiniLLM (Microsoft, 2023)

arxiv.org/abs/2306.08543. Reverse-KL distillation: instead of matching teacher distribution at every token (forward KL), reverse the direction. Reduces the student's tendency to spread probability mass across teacher's low-probability tokens. State-of-the-art for small-model distillation at release.

DistillKit (Arcee AI, 2024)

github.com/arcee-ai/DistillKit. Production-grade distillation framework. Implements logit-, hidden-state-, and response-distillation. Used by Arcee's commercial small-model distillation services.

R1-Distill technique

DeepSeek's documented approach for R1-Distill: generate 800k high-quality reasoning traces from R1 671B on math/code/science prompts (verified for correctness); SFT smaller base models (Qwen, Llama at various sizes) on these traces. No RL on the smaller models. Result: small models inherit substantial reasoning capability at SFT-only cost.

Notable: R1-Distill is response-only distillation. The R1 paper documents that they tried RL on smaller models and it underperformed pure SFT-on-R1-traces — small models benefit more from imitating a strong teacher than from trying to learn reasoning from scratch.

What signals transfer

  • Format and structure: easily transferred (the student picks up the teacher's output formatting).
  • Common knowledge: transferred to the extent the student has capacity.
  • Reasoning patterns: substantially transferred (the basis of R1-Distill).
  • Tail knowledge: not transferred (the student lacks parameters to store it).
  • Calibration: poorly transferred (small models tend to be overconfident even after distillation).

Compute economics

Distillation is much cheaper than training from scratch. For a 32B target from a 671B teacher:

  • Teacher-output generation: ~$50k–$500k (depending on response length and infrastructure).
  • Student SFT: 50–500 GPU-hours per epoch on the distillation set (~$10k–$50k for a small fine-tune).
  • Total: ~$60k–$550k for a strong distilled 32B model.

Compare to training a 32B from scratch on FineWeb-Edu (~$1M+ compute). Distillation is the cheap path to strong small models when you have a teacher you can call.


R1-Distill technique and model-specific distillation case studies

Specific examples of distillation in practice with documented results.

DeepSeek-R1-Distill family

Released January 2025 alongside R1. Six models distilled from R1's reasoning traces:

Model Base AIME 2024 MATH-500 GPQA Diamond License
R1-Distill-Qwen-1.5B Qwen-2.5-Math-1.5B 28.9% 83.9% 33.8% MIT
R1-Distill-Qwen-7B Qwen-2.5-Math-7B 55.5% 92.8% 49.1% MIT
R1-Distill-Qwen-14B Qwen-2.5-14B 69.7% 93.9% 59.1% MIT
R1-Distill-Qwen-32B Qwen-2.5-32B 72.6% 94.3% 62.1% MIT
R1-Distill-Llama-8B Llama-3.1-8B 50.4% 89.1% 49.0% MIT/Llama
R1-Distill-Llama-70B Llama-3.3-70B 70.0% 94.5% 65.2% MIT/Llama

The 32B-Qwen model became the practical workhorse for self-hosted reasoning — strong, fits on one H100 at FP8, MIT-licensed.

Anthropic Haiku from Sonnet (rumored workflow)

Anthropic hasn't publicly documented its distillation pipeline but the pattern visible from model behavior suggests: Sonnet is the production teacher for Haiku's training data; Opus is the research teacher for Sonnet. The Anthropic-published Constitutional AI papers describe a similar self-improvement loop.

OpenAI training distillation (rumored)

OpenAI's o3-mini and 4o-mini families are widely understood to be distilled from larger models. Specifics: closed. The performance/size pattern strongly suggests distillation in the training pipeline.

Microsoft Phi from GPT-4

Phi-3-mini and Phi-4 used synthetic textbook content (GPT-4 generated) plus filtered web. This is distillation by another name — the small model learns from outputs of a larger model.

Cohere Command R from Command R+

Cohere's R/R+ family demonstrates a similar pattern: larger model's outputs serve as teaching signal for smaller variants.

Open-community distillations 2024–2026

The community shipped dozens of distilled models on open backbones:

  • Dolphin variants (Eric Hartford / Cognitive Computations).
  • OpenHermes successors.
  • Nous Hermes 3.
  • Various LLaVa multimodal variants distilled from frontier multimodal models.

The 2026 reality: most open small models in production are distillates of frontier models, not from-scratch training.


RLHF preference data: UltraFeedback, HH-RLHF, Constitutional AI

Preference data is the input to RLHF and DPO. Sources and methods.

Human preference datasets

  • HH-RLHF (Anthropic, 2022) — 161k pairs of helpful/harmless preferences. The first open RLHF preference dataset.
  • OpenAssistant preferences — crowd-sourced ratings.
  • WebGPT comparisons (OpenAI, 2021) — pairs from research models.

Synthetic preference data

  • UltraFeedback (Tsinghua, 2023) — GPT-4-rated preferences over 4 model responses across 64k prompts. The default open preference dataset.
  • Nectar (Berkeley, 2023) — preferences over 7 models' responses.
  • HelpSteer / HelpSteer2 (NVIDIA, 2024) — fine-grained multi-attribute ratings (helpfulness, correctness, coherence, complexity, verbosity).

Constitutional AI (Anthropic, 2022)

arxiv.org/abs/2212.08073. Generate AI feedback against a "constitution" (set of principles). The model critiques its own outputs against the principles; the critique becomes training signal. Reduces dependence on human raters for safety-relevant feedback. Foundational for Anthropic's training pipeline.

RLAIF (RL from AI Feedback)

The generalisation of Constitutional AI: use AI feedback in place of human feedback for preference data. Cheaper, more scalable. The 2024–2026 standard for most production preference data — humans review samples; AI generates the bulk.

DPO and the simplification

DPO (Direct Preference Optimization, Rafailov 2023) reformulates RLHF as a supervised loss on preference pairs. No reward model needed. The 2024 default for open-community alignment because of operational simplicity. Variants: IPO, KTO, SLiC, SimPO — each tweaks the loss for different empirical advantages.

Preference data quality

  • Diversity — covering many domains and styles.
  • Difficulty — including hard pairs where the right answer is non-obvious.
  • Calibration — the strength of preference matters (slightly better vs much better).
  • Multi-attribute — separate axes (helpful, harmless, honest) rather than monolithic "better."

The 2026 frontier in preference data: fine-grained attribute ratings + skill-specific pairs + adversarial preferences (intentionally edge-case examples).


Legal landscape: copyright, fair use, NYT v. OpenAI, output license terms

The legal questions around synthetic data and distillation are unresolved through 2026.

Training data copyright

The core question: does training a model on copyrighted text constitute infringement? The US position is contested:

  • NYT v. OpenAI (filed Dec 2023, ongoing 2025). New York Times sued OpenAI claiming GPT-4 reproduced NYT articles substantially. Discovery underway 2024–2025; resolution expected 2026 or later. The case will partially define training-data legal status.
  • Bloomberg / Concord Music / Universal Music lawsuits — similar claims for music/IP.
  • Sarah Silverman / authors' lawsuits against OpenAI, Meta — class-action by authors.
  • Various artists vs Midjourney / Stability — image-generation training claims.

The early summary judgments have varied. Many fair-use defenses have survived motions to dismiss; some haven't.

Robots.txt and access

OpenAI introduced GPTBot user-agent in August 2023 with robots.txt opt-out support. Anthropic's ClaudeBot, Google's Google-Extended followed. These are not legally binding (no statute requires honoring robots.txt) but represent industry norm.

Output license terms

The question that matters for distillation: can you train on outputs of a closed model? Per provider:

  • OpenAI Terms of Service: prohibit "using output to develop models that compete with OpenAI." Interpreted strictly, prohibits open-source distillation. Enforcement: unclear.
  • Anthropic ToS: similar prohibition on competitive model development.
  • Google Vertex AI: prohibit using output to train models.
  • Meta Llama license: permissive — outputs are usable; derivative models permitted up to 700M MAU.
  • Apache 2.0 / MIT models (DeepSeek R1, Qwen): no restrictions on output usage.

Practical implication: most open-community distillation happens from open-license teacher models (R1, Qwen, Llama). Distilling from closed APIs (OpenAI, Anthropic) is legally fraught even if technically feasible.

EU AI Act and training data

EU AI Act requires GPAI providers to publish summaries of training data. As implementation rolls out 2025–2026, more disclosure is required, surfacing dataset choices that were previously opaque.

Practical guidance

  • Use open-license teacher models for distillation when possible.
  • Document data provenance clearly (audit trail of every dataset source).
  • For commercial deployment, get legal review of your training data pipeline.
  • Track lawsuit outcomes; expect the legal landscape to keep shifting through 2027.

Distillation detection: fingerprinting models from outputs

Can you tell if a model was distilled from another? Increasingly, yes.

Stylistic fingerprinting

Models have characteristic linguistic patterns — word choice, sentence structure, common phrases. A model trained on GPT-4 outputs picks up GPT-4's distinctive style (use of "Certainly!", "It's worth noting that", "I hope this helps").

Detection: train a classifier on labelled outputs from many models; classify candidate model outputs. Accuracy on family-level detection (was this distilled from a GPT-family model?) ~85%; on specific-model detection lower.

Logit fingerprinting

If you have logit access to the candidate model, comparing logit patterns to known models reveals signatures. Used in research; impractical against closed APIs.

Self-identification probes

Ask the model "what model are you?" — distilled models often identify with the teacher ("I am ChatGPT"). Mitigated by post-distillation tuning specifically to override self-identification.

Hidden-state similarity

If you can probe internal activations, similar models produce similar activation patterns. Requires open weights or carefully designed probing experiments.

Watermarking outputs

Anthropic, Google, OpenAI have all explored output watermarking — subtly biasing the model's sampling so that outputs are statistically detectable as model-generated. Watermark survives some distillation; defeats casual fingerprinting evasion.

Why detection matters

  • License enforcement. If OpenAI ToS prohibits distillation, detection enables enforcement.
  • Academic integrity. Researchers claim "from-scratch training" but distilled from frontier — detection enables verification.
  • Provenance disclosure. EU AI Act may require disclosure of derivation; detection enables third-party verification.

Open question: how strict is detection?

Detection is probabilistic, not certain. A model can be distilled without obvious fingerprints if the distillation pipeline includes style-normalisation and tail-distribution adjustments. The 2026 state of art: distillation detection works on careless distillation; sophisticated distillation evades detection.


The diminishing-returns wall: what 2026 papers are saying

Synthetic data scaling has limits. The 2025–2026 literature is starting to characterize them.

Synthetic-data scaling laws

A 2024 trend: scaling laws specifically for synthetic data. Findings:

  • Returns to additional synthetic data diminish faster than returns to web data.
  • Quality matters more at scale; filtering aggressively beats generating more.
  • Mixing synthetic and human data has compounding benefits; pure-synthetic plateaus earlier.

Model collapse revisited

Shumailov et al. (Nature, July 2024) showed that training generations of models exclusively on synthetic data degrades. The original "model collapse" paper. Replications and extensions (2024–2025) confirmed: pure recursive synthetic training degrades; mixing real data prevents collapse.

The 2026 consensus: human data remains the anchor. Synthetic data is leverage, not replacement.

Quality-controlled bootstrap

The pattern that survives: synthetic data, aggressively filtered, mixed with real data, generates strong models. The 2026 frontier pipelines are 50–70% synthetic mixed with human/web data, with verifiable-rewards filtering wherever applicable.

The 2026 open questions

  • How far does the verifiable-rewards approach (R1, AlphaProof) generalise beyond math/code?
  • Does synthetic data for "general reasoning" (not domain-specific) keep scaling?
  • What's the equivalent of "FineWeb-Edu" for synthetic instruction data — what's the principled quality filter that keeps yielding gains?
  • Are we approaching saturation on the open instruction-tuning canon, or is the next 10× still ahead?

The honest answer through May 2026: nobody fully knows. The papers keep coming; each adds a tile to the mosaic; the full picture is still being painted.


Domain-specific synthetic data recipes

Synthetic data techniques specialised for particular domains. Each domain has its own constraints and best practices.

Math reasoning data

Pipeline: (1) seed with competition problems + textbook examples (NuminaMath, MATH train split, AMC archives). (2) Generate reasoning traces with a strong reasoning model (R1, o3 in batch). (3) Verify final answer via SymPy / numeric check / multiple-choice match. (4) Keep only traces with correct final answer. (5) Optional: have a separate model rate trace quality (clarity, no error chains); keep high-rated.

Yield: 30–70% of generated traces pass verification. The 800k-trace R1-Distill dataset reportedly came from generating 3–5M raw traces.

Code generation data

Pipeline: (1) seed with problem descriptions (LeetCode, HackerRank, BigCodeBench train). (2) Generate solutions with a strong coding model. (3) Execute against unit tests; keep passing solutions. (4) Optional: generate multiple solutions per problem; keep diverse ones (different algorithms / styles).

Yield: 40–60% pass rate on first generation; growing with model capability. The 2026 scaled production code datasets are dominated by this approach.

Long-context training data

Pipeline: take a long document; generate questions that require synthesising across the document; generate answers grounded in specific sections. Yields training data for long-context capability that the model otherwise struggles with.

Specifics: documents 100k+ tokens; multi-hop questions requiring multiple sections; answers with citations. Used in Gemini, Claude long-context training.

Multilingual data

Pipeline: take English instruction data; translate to target languages with strong translation models or native multilingual models; have native speakers review samples. Or: generate directly in target language with multilingual capable model.

Quality control: native speaker review of a 1–5% sample; back-translation check; perplexity vs reference multilingual data.

Tool-use / agentic data

Pipeline: define a tool set; generate user requests that require those tools; have a strong agent model demonstrate the tool-call sequence; verify the sequence achieves the goal. Used to train agent capability in models that didn't see agentic data in pretraining.

Safety / red-team data

Pipeline: define harmful categories; prompt a strong model to generate (refusal-worthy request, ideal refusal response) pairs; have safety experts review samples; use as SFT data to instil refusal behaviour.

Yield: most generations are usable; the bottleneck is category coverage (need diverse refusal scenarios).

RAG / retrieval-augmented data

Pipeline: take a corpus of documents; for each document, generate questions answerable from that document plus distractor documents; the training data is (question, retrieved documents including correct one, grounded answer). Trains both retrieval-aware generation and citation behaviour.


Open datasets worth studying in 2026

Beyond the canonical datasets covered earlier, the 2025–2026 open releases worth a serious look.

Tulu 3 SFT mix (Allen AI, Nov 2024)

960k examples carefully curated from many sources. Documented recipe; reproducible. Reference for open post-training in 2025.

OpenThoughts (Stanford / Sky-T1 lineage, 2025)

114k reasoning traces released alongside Sky-T1 model. Open-source reproduction of o1-style reasoning data.

OpenR1 (HuggingFace, Jan 2025)

Open-source reproduction of R1's training data pipeline. Includes synthetic math/code reasoning traces, training scripts, distillation recipe.

NuminaMath / Numina Math Olympiad

860k math problems with reasoning traces. The winner-dataset for the 2024 NeurIPS math olympiad. Excellent training data for math-capable models.

Persona Hub (Tencent, 2024)

arxiv.org/abs/2406.20094. 1B+ personas; each persona drives synthetic prompt generation. Diverse instruction-data source at scale.

SmolTalk (HuggingFace, 2024)

Curated 1M conversation dataset designed for small-model fine-tuning. Filtered for quality; permissive license.

The Stack v2 (BigCode, 2024)

3T cleaned permissive-license code, with deduplication, license metadata, and provenance. The foundation of open code-model training in 2024–2026.

Dolma v1.7 (Allen AI, 2024)

Multi-source 3T+ token pretraining corpus with explicit provenance and quality metadata. Reference for transparent open pretraining.

Watching for in 2026–2027

The 2025 community appetite is for: open verifiable-rewards reasoning datasets, open multimodal training datasets, open agentic-data datasets. Releases tracking these gaps are the ones most worth studying as they appear.


Synthetic data infrastructure: batch inference at trillion-token scale

Generating training data at trillion-token scale requires real infrastructure. The 2026 stack:

Batch inference engines

For generating training data, batch inference (high throughput, latency-insensitive) differs from production serving (low latency, predictable concurrency). Engines optimised for batch:

  • vLLM batch mode — same engine as serving but configured for max throughput.
  • TensorRT-LLM batch — NVIDIA's optimised inference engine; ~30–50% faster on batch than vLLM.
  • SGLang — Stanford's RadixAttention engine; particularly good for prefix-sharing across many similar prompts.
  • Custom CUDA / Triton — frontier labs write custom kernels for their specific generation patterns.

Throughput at batch scale (B200 GPU, 70B model, FP8, max batch):

  • Output tokens: 5,000–15,000 tokens/sec/GPU.
  • A 100k-GPU cluster: ~1 trillion tokens/day theoretical max.

Generation prompt orchestration

For diverse synthetic data, you need diverse prompts. Orchestration patterns:

  • Prompt template + parameter sweep. Templates parameterised by topic, difficulty, persona; sweep across millions of combinations.
  • Seed-grow. Start with a few thousand human-written seeds; have the generator expand each seed via Evol-Instruct or similar.
  • Web-grounded. Seed prompts from web documents; ground generation in real content.

Storage and processing

Generated outputs at trillion-token scale require petabyte storage. Stack:

  • Object storage (S3, GCS, Azure Blob) for raw outputs.
  • Apache Spark or Ray for distributed filtering and dedup.
  • Parquet format for downstream training-data consumption.

Deduplication infrastructure

MinHash on 1T-token corpus: 12–48 hours on a moderate Spark cluster. Semantic dedup: embed 1B documents at moderate dimensions, cluster with FAISS — 1–7 days on GPU cluster.

Quality classifier serving

Run quality classifier across the full generated set. Small classifier (DeBERTa-base) at FP16 on T4 or A10 GPU: ~5k docs/sec/GPU. 1B docs in 50 hours on 10 GPUs.

Cost summary

Generating 1T tokens of training data on bare-metal B200:

  • Compute: 1T / (10k tokens/sec/GPU) / 86400 = ~115 GPU-days
  • At $40/GPU-day bare-metal: ~$4,600 raw compute
  • Plus quality filtering pipeline: ~$2k
  • Plus dedup, storage, orchestration: ~$3k
  • Total: ~$10k for 1T tokens (rough) — versus the $1M+ training compute for a frontier model. Synthetic generation is much cheaper than training.

For API-based generation (no bare metal):

  • Frontier API rates: $2–$15 per M tokens output.
  • 1T tokens: $2M–$15M. Prohibitive at this scale.
  • Open-license API rates: $0.20–$2 per M tokens.
  • 1T tokens: $200k–$2M. Feasible but not cheap.

The frontier labs do this in-house with custom infrastructure; smaller labs use a mix of in-house generation for the bulk and API generation for high-quality slices.

Provenance and tracking

Every generated example should carry metadata:

  • Generator model + version.
  • Prompt template + parameters.
  • Filter pass/fail per filter step.
  • Eval scores (if scored).
  • Timestamp.

Required for reproducibility, eval contamination analysis, regulatory disclosure (EU AI Act), and debugging when a downstream model behaves badly.


Frontier lab pipelines: what we know about OpenAI, Anthropic, Google, Meta synthetic

What's publicly documented (and credibly rumored) about how the major labs produce training data.

OpenAI

Closed about pipeline specifics. Public observations: GPT-4o's training included synthetic data (acknowledged in the system card); o-series training relies heavily on verifiable-rewards synthetic data (math + code); OpenAI's Sora and image models trained on synthetic captioning at scale. The 2024 "Q*" / "Strawberry" rumors point to the reasoning-data pipeline that became o1.

Anthropic

Constitutional AI is the public-documented pipeline (arxiv.org/abs/2212.08073). RLAIF (RL from AI Feedback) — Claude critiques and revises its own outputs against a constitution. Synthetic preference data dominates Anthropic's training pipeline. Reasoning data (for thinking mode) likely synthesised by Claude Opus and distilled to Sonnet / Haiku.

Google

Documented use of TPU-scale synthetic data generation for Gemini training. The Gemini 1.5 paper documents distillation across model sizes. Deep Think training data likely uses verifiable-rewards math + code generation similar to R1's approach.

Meta

Llama 3 paper (Meta's 92-page tech report) documents substantial use of synthetic data in post-training. Includes synthetic preference data, synthetic code data, synthetic math data. Llama 4 (2025) doubled down on this approach. Meta's published recipes are among the most transparent in the industry.

DeepSeek

Most publicly transparent of all frontier labs. R1 paper documents the full reasoning synthetic pipeline: SFT cold-start, RL with verifiable rewards, distillation to smaller models. V3 paper documents the MoE training pipeline including synthetic data fractions.

Microsoft (Phi family)

Phi-1/2/3/4 papers explicitly document synthetic-data-dominant training. The Phi recipe — textbook-quality synthetic content + careful curation — has been replicated by the open community via Cosmopedia and is one of the most influential public pipelines.

Mistral, Cohere, xAI

Less public documentation. Mistral's papers occasionally reference synthetic data. Cohere has emphasized synthetic data in marketing but not detailed pipelines. xAI ships frequently with minimal documentation.

Open-community proxies

When frontier labs don't disclose, the open community provides proxies via Tulu 3 (Allen AI), Nemotron-CC (NVIDIA), Cosmopedia (HuggingFace), OpenAssistant. These pipelines collectively document what serious synthetic data engineering looks like.

Common patterns across labs

  • Strong frontier model as generator + filter.
  • Verifiable rewards where applicable (math, code).
  • Multi-stage filtering (programmatic, then classifier, then LLM-as-judge).
  • Mixing with curated real data to prevent collapse.
  • Iterative bootstrapping (last gen's model produces this gen's data).
  • Heavy investment in pipeline tooling (custom Spark, custom inference for batch generation).

Quality filtering at scale: classifiers, perplexity, MinHash, semantic dedup

A working synthetic-data pipeline spends more on filtering than on generation. Methods that scale.

Substring and n-gram dedup

Detect exact and near-exact duplicates via hashing. 13-gram MinHash is the standard. Run before any downstream filtering — dedup typically reduces dataset size by 20–50%.

Implementation: spark + hashing (DCLM pipeline, FineWeb pipeline). At billion-document scale, expect 4–24 hours on a moderate Spark cluster.

Classifier-based quality filtering

Train a small classifier (DeBERTa-base, distilled BERT) to predict whether a document is "high quality." Training signal: hand-label 5–20k documents; let the classifier extrapolate.

FineWeb-Edu's classifier was Llama-3-70B labelling 500k examples for "educational value 0–5", then training a small classifier to mimic. Cheap to apply at scale; meaningful quality lift.

Perplexity filtering

Score documents under a small reference model. Very low perplexity = repetitive/boilerplate; very high perplexity = garbled. Keep the middle band. Simple, scalable, captures structural quality issues.

Semantic dedup

Embed documents with a sentence-transformer; cluster by similarity; keep one representative per cluster. Catches paraphrased duplicates that n-gram dedup misses.

Cost: embedding 1B documents at modest dimensions is feasible on a moderate GPU fleet (~$10k–$50k compute). Critical for synthetic pipelines where the generator produces semantically-near-duplicate outputs.

Programmatic verification

For verifiable domains (math, code, multiple-choice), check correctness directly. Drop incorrect generations. Often produces 30–60% rejection rate on first-pass generation; the kept set is gold.

LLM-as-judge filtering

For non-verifiable quality dimensions (educational value, style fit, instruction adherence), use an LLM judge. Expensive at scale; usually applied after cheaper filters as the final pass.

Pipeline ordering

Typical order, cheapest-first:

  1. Exact dedup (hashing).
  2. N-gram MinHash dedup.
  3. Length filters (drop too-short, too-long).
  4. Language filter (drop non-target languages).
  5. Perplexity filter.
  6. Classifier-based quality filter.
  7. Semantic dedup.
  8. Programmatic verification (if applicable).
  9. LLM-as-judge final pass (sampled or full).

A 1B-document raw pipeline might produce 50–200M post-filter examples. The yield ratio depends on the generation quality and the filter strictness.


Self-improvement loops: bootstrapping, STaR, iterative DPO

Models that improve themselves are the 2024–2026 frontier. Specific patterns.

STaR (Self-Taught Reasoner, Zelikman et al., 2022)

Generate reasoning traces; keep traces leading to correct answers; fine-tune on kept traces; iterate. The grandparent of modern reasoning bootstrapping.

Self-Rewarding LLMs (Yuan et al., 2024)

Same model generates responses and judges them. Two heads on the same backbone; iteratively trained with the model's own preference data. Bypasses the need for a separate reward model.

Iterative DPO

After a DPO round, generate fresh preference data using the improved model; re-run DPO. Each iteration narrows the gap to optimal preferences. Common pattern in production post-training.

Constitutional AI loop

Anthropic's documented pattern: model generates a response; critiques it against a constitution; revises; the (response, critique, revision) triples become training data. The model "argues with itself" toward better outputs.

AlphaProof / AlphaGeometry (DeepMind, 2024)

Specialised self-improvement: a model generates math proof attempts; a formal verifier (Lean) checks correctness; the model trains on verified proofs. Achieved IMO silver-medal performance. The verifier-in-the-loop pattern at its purest.

Why self-improvement works

Two underlying mechanisms:

  1. Verification is easier than generation. The model can recognize a good output even when it doesn't reliably generate one. Use that gap to filter.
  2. Diverse sampling explores capability. Sample many candidates; the best of N is better than the median; train on the best of N; the median improves; repeat.

Where self-improvement plateaus

  • Calibration. Self-judges have biases; without external grounding the model can become confident in wrong patterns.
  • Distribution shift. Pure self-improvement narrows the data distribution. Mix in external data.
  • Verifier brittleness. A bad verifier teaches bad lessons. The verifier must be more reliable than the generator on the dimension you're optimising.

The 2026 production pattern

Iterative self-improvement is now standard in frontier post-training. Cycle: generate, filter, train, evaluate, repeat. Each cycle typically yields 1–3 percentage point improvements on target benchmarks; diminishing returns hit after 3–5 cycles for most pipelines.


Synthetic data for multimodal training

Synthetic data extended beyond text to image, audio, video.

Vision: synthetic image captions

LLaVA, BLIP, etc. trained on synthetic image-caption pairs. Pipeline: take an image; generate a caption with a strong VLM; train a student to mimic. The dominant paradigm for open multimodal models.

Vision: GPT-4V annotation

Captioning data at scale: prompt GPT-4V or Claude vision on millions of images; collect detailed captions; train new VLMs on the captions. The 2024 default for open-community VLM data.

Audio: synthetic ASR data

Generate text → TTS to audio → train ASR on the (audio, text) pair. Used to bootstrap ASR for low-resource languages.

Audio: synthetic dialog audio

Generate dialog text → render via diverse TTS voices → train multimodal models on (audio, text). Used in Whisper-successor training.

Video: synthetic captions and segments

Strong video VLMs (Gemini, GPT-4o) annotate millions of clips with structured descriptions. The output trains smaller open VLMs.

Cross-modal synthetic alignment

The frontier 2026 challenge: synthesise data that aligns multiple modalities (image + audio + text describing the same event). Used for Sora-style video models and multimodal reasoning.

Quality control specifics

  • Vision: check generated captions against image content (CLIP similarity).
  • Audio: spot-check generated audio for naturalness; train classifier to detect TTS artifacts.
  • Cross-modal: ensure modalities actually align (text describes what's in image, etc.).

The cost crossover: when does generating beat buying labels?

Concrete math on synthetic vs human labelling.

Per-label cost benchmarks

  • Crowdsource general labels (Amazon MTurk, Scale crowd): $0.10–$2 per label.
  • Crowdsource quality labels (curated workforce): $1–$10 per label.
  • Domain expert labels (lawyers, doctors, finance pros): $20–$200 per label.
  • Synthetic generation (strong model, no human review): $0.001–$0.01 per example.
  • Synthetic generation + human spot-check (10% sampled review): $0.05–$0.20 per example effective.

When synthetic wins

  • General instruction-tuning, conversational data, creative writing examples: synthetic clearly wins.
  • Math, code with programmatic verification: synthetic dominates (verification is cheap).
  • Domain-specific where domain experts cost $100+/label: synthetic + expert review on 5–10% saves >90% vs full human labelling.

When human wins

  • Highly subjective tasks (creative quality, cultural appropriateness): humans add value beyond what model judges capture.
  • Adversarial / safety labels: humans surface attack patterns models miss.
  • High-stakes / regulated domains (medical, legal): human review required for compliance.
  • Initial label sets for tasks the model hasn't seen — humans must define the gold standard before synthetics can amplify.

The hybrid pattern

Most production pipelines combine: humans define the rubric and label 1–5% as gold standard; synthetic generates the bulk; humans spot-check a sample of synthetic; humans review difficult cases identified by the filter pipeline.

This pattern scales 100× cheaper than full-human while maintaining quality.


The bottom line

The data wall is real, and the lab that wins the next generation will not be the one with the largest web crawl. It will be the one with the best generator-plus-verifier pipeline. The biggest lever is the filter: a mediocre generator behind a strong verifier produces excellent training data, while a strong generator behind a weak filter produces fluent slop that quietly degrades the student. Treat synthetic data as a controlled experiment with three knobs — prompt diversity, generator quality, filter strictness — and budget engineering time accordingly.

Five takeaways to leave with:

  • Generation is cheap and getting cheaper. Filtering, deduplication, and distribution shaping are where the engineering value lives.
  • Verifiable domains (math, code, structured outputs) are where synthetic data is essentially solved; verifier-free domains still require human-in-the-loop calibration.
  • Model collapse is real but is a curation failure, not a fundamental ceiling. Mix in real data, monitor for distribution narrowing, and re-evaluate often.
  • Distillation captures 70–95% of teacher quality at a fraction of inference cost; for the production tier below frontier, it is almost always the right move.
  • Synthetic data is not a capability multiplier — students are still bounded by parameter count and architecture. It is a capability transfer mechanism.

For neighboring topics: reasoning model serving is the demand side that makes distillation economically urgent, and post-training (RLHF, DPO) is where the synthetic-data and RL-polish stages compose.


FAQ

Is synthetic data legal? Generally yes, but model-output terms-of-service for frontier APIs often restrict using their outputs to train competitor models. Read your contracts.

Will model collapse happen? Not if you mix synthetic with real data and use quality filtering. The headline papers describe degenerate setups that production doesn't replicate.

Can I use a small open model to generate data? Yes, but quality is bounded by the generator. For specialized domains where small models do well, this works. For frontier-quality training, you need a frontier-quality generator.

Is synthetic data cheaper than human-labeled data? Much cheaper per example. Pipeline engineering is expensive upfront but amortizes over billions of examples.

Does synthetic data make models worse at "real-world" tasks? Only if the synthetic distribution diverges from the real one. Quality pipelines control for this. The risk is real; the solution is mixing and audit.

Should I distill or train from scratch? For smaller deployment models: distill from a stronger teacher. Almost always wins on quality-per-compute. From-scratch is only better when the teacher's biases are unacceptable.

Can I distill a reasoning model into a non-reasoning architecture? You can train a non-reasoning model on reasoning traces. The student learns to produce reasoning-style outputs but may not match the teacher's depth. See reasoning model serving guide.

How much synthetic data is too much? Workload-dependent. Common production mixes range from 10% synthetic (conservative) to 70%+ (aggressive). Watch for distribution drift in evals.

What's the difference between response distillation and logit distillation? Response distillation trains the student on the teacher's emitted text via standard SFT cross-entropy. Logit distillation trains the student to match the teacher's full token-level probability distribution via KL divergence. Logit distillation captures more information per example but requires full logit access (closed APIs don't provide this) and inline teacher inference during training (expensive). Most production LLM distillation in 2026 is response distillation; logit distillation remains common for encoder-style models like DistilBERT (Sanh et al., 2019 — arXiv:1910.01108).

Is Magpie better than Self-Instruct? For harvesting alignment data from an instruct-tuned teacher, yes — Magpie's trick of priming with only the assistant template lets the teacher generate both the prompt and the response from its own posterior, producing more diverse and less seed-biased data than Self-Instruct's seed-evolution approach. For domain-specific generation where the seeds carry important constraints, Self-Instruct-style approaches remain useful.

What's "on-policy" synthetic data and why does it matter? On-policy data is generated by the same model being trained (or a very close checkpoint). The signal it provides is shaped by what the current model actually says, which makes it ideal for capability shaping — the gradients land on the policy's actual outputs, not on a foreign distribution. Off-policy data (from a teacher) is better for capability transfer. Most production pipelines mix both.

Are there legal risks with distilling from a closed API? Yes. Most frontier API terms of service explicitly prohibit using outputs to train competitor models. Some labs are more permissive than others. Several public open-source models have included such data and have faced both legal and reputational consequences. The safer alternative for commercial distillation is using open-weight teachers (Llama, Qwen, DeepSeek, Mistral families) whose licenses permit synthetic-data generation.

Can synthetic data improve a base model's pretraining, not just post-training? Yes — this is the Phi family's central thesis (Gunasekar et al., 2023 — arXiv:2306.11644; Abdin et al., 2024 — arXiv:2404.14219). Synthetic "textbook-quality" data during pretraining can substitute for a substantial fraction of raw web tokens with better per-token capability gains. Most frontier labs now use synthetic data in mid-pretraining stages, not just post-training.

How does synthetic data interact with post-training and RLHF? Synthetic data is the substrate for most modern post-training. SFT data is increasingly synthetic. Preference pairs are increasingly AI-generated. Rejection-sampling fine-tuning is itself a synthetic-data loop. RLVR uses verifier-filtered synthetic rollouts as training signal. The two areas are not separable; the post-training stack is largely a synthetic-data pipeline with an RL outer loop attached.

Should I worry about copyright in synthetic-data outputs? Less than in raw web data, but it depends on the teacher's training. If the teacher memorizes copyrighted text and reproduces it, the synthetic outputs can contain that text. Filtering for near-verbatim matches against known copyrighted corpora is a standard step in pipelines that care about this. Generation prompts that explicitly request copyrighted content should be rejected upstream.

Why is the Phi family's synthetic-data approach controversial? The Phi papers report dramatic capability-per-parameter improvements from synthetic "textbook" data, but follow-up evaluations show Phi models often underperform their headline benchmarks on out-of-distribution tasks. The community read is that heavy synthetic-data pretraining can produce a model that excels at benchmark-shaped questions while being narrower than the parameter count suggests. The lesson is not that synthetic data is bad; it is that benchmark composition has to be guarded carefully when synthetic data dominates the training mix.

How does rejection sampling compare to full RL for synthetic-data generation? Rejection sampling — sample N, keep the best — recovers 70–90% of full RL's quality gain at 10–30% of the engineering cost. It's the workhorse of frontier post-training data generation. Full RL (PPO, GRPO) extends the ceiling by allowing the policy to discover behaviors outside its current support, but the marginal gain over rejection sampling is usually small once the rejection-sampling budget is well-tuned. Most production pipelines run rejection sampling continuously and reserve full RL for capability frontiers. See post-training: RLHF, DPO for the algorithm side.

What's the difference between distillation and knowledge distillation? In the LLM literature, the terms are used roughly interchangeably. Hinton et al.'s original "knowledge distillation" referred specifically to soft-label / logit distillation. Modern LLM "distillation" usually means response distillation (training on teacher outputs as SFT data). When a paper refers to "knowledge distillation" in the strict Hinton sense, expect KL-divergence loss against full teacher logits.

How should I think about synthetic data for RAG systems? RAG-specific synthetic data is a fast-growing subspecialty. Pipelines generate (query, retrieved-documents, answer) triples by sampling questions a strong model would plausibly produce against a corpus, retrieving with a baseline retriever, and having a teacher produce the grounded answer. The student is then trained to use retrieved context correctly. Both query generation and answer generation can be synthetic; filtering for groundedness (no hallucinated facts) is the main quality control.

Does synthetic data help small models more than large ones? Empirically yes, for capability-shaping tasks. Smaller models have more room to grow on benchmark-shaped tasks and benefit more from focused synthetic data per parameter. For frontier-scale pretraining, synthetic data helps but doesn't change the trajectory as dramatically. The Phi family demonstrates the small-model case; the frontier labs' continued investment in synthetic data demonstrates the large-model case.

What's the right ratio of synthetic to real data? There is no universal answer. Common 2026 production ratios: 20–40% synthetic in pretraining mid-stages, 50–80% synthetic in post-training, 95%+ synthetic in capability-specific fine-tuning (math, code). The ratio that works depends on the diversity of your real data, the quality of your synthetic pipeline, and the workload. Audit with held-out evaluations that include out-of-synthetic-distribution prompts.

How is synthetic data related to eval infrastructure? Closely. The same generation-filter-validate machinery used to produce training data is used to produce evaluation data — adversarial probes, capability-specific benchmarks, calibration sets. The two pipelines often share infrastructure but should never share data; eval contamination is the most damaging failure mode of conflating them. Strict provenance tracking keeps them separate.

What's the cheapest way to generate 100k high-quality math reasoning traces? Use DeepSeek R1 or QwQ-32B via API (both are MIT/Apache licensed and explicitly distillation-friendly). At ~$0.55/$2.19 per M tokens for R1 hosted, 100k traces × 5k tokens each = 500M tokens ≈ $1,100. Self-host R1-Distill-32B if you want to amortize over millions of traces: ~$0.10/M output tokens at scale. Quality filter ruthlessly afterward — drop traces with wrong final answer (programmatic check), drop too-short, drop language-mixed.

Can I distill a closed-source frontier model legally? Technically possible; legally fraught. OpenAI, Anthropic, Google ToS prohibit "developing competing models" using their outputs. Enforcement is unclear (no public lawsuit specifically on this) but the risk is real for commercial deployment. Use open-license teachers (R1, Qwen, Llama) for legally clean distillation. If you must distill from a closed API, get legal review.

How big should my SFT dataset be? For a strong post-training fine-tune of a 7B–70B model: 50k–500k diverse, high-quality examples is the modern sweet spot. Tulu 3 used 960k; many strong fine-tunes use 100k–200k. Past 1M, diminishing returns dominate unless the data is exceptional. Quality filtering is the lever — 100k filtered beats 1M unfiltered.

Is the "scale by 10x" assumption still valid for synthetic data? Less so than for web data. Synthetic data has diminishing returns to scale faster than web data. The 2026 emphasis has shifted to quality and verification: 10x more verified-correct math traces helps more than 10x more raw generated math traces. The right scaling axis is "verified correct examples," not "tokens."

How do I know if my synthetic pipeline is degrading the model? Hold out a real-data eval set the model has never seen (and the generator has never seen). Train with increasing synthetic fractions; measure on the held-out set. If quality drops at higher synthetic ratios, you're hitting model-collapse territory. Mix in real data to recover; refine your synthetic quality filters.

What's "verifiable-rewards" data, and why is it special? Data where correctness can be checked programmatically: math problems with numeric answers, code with passing tests, multiple-choice with verified-correct labels. Special because you can scale generation without scaling human annotation — generate millions of candidates, keep only the verified-correct ones. R1's training relied heavily on this. Limits: only applies to verifiable domains (math, code, structured Q&A); much of human knowledge isn't verifiable this way.

Are there safety risks specific to synthetic data? Yes. Synthetic data can amplify biases present in the generator (a slightly-biased teacher distills into a more-biased student). Synthetic data can encode harmful patterns if the generation pipeline isn't safety-checked. Mitigation: include safety filtering as a step in every synthetic pipeline; periodically eval the resulting model on safety benchmarks; don't trust the generator's safety alone.

How does synthetic data affect multilingual capability? Substantially. Most synthetic-data pipelines are English-biased (GPT-4, Claude generate higher-quality English than other languages). Training heavily on English synthetic data without correction degrades multilingual performance. Solution: generate synthetic data in target languages, often via translation + native-speaker review.

Can I use a small distilled model as the teacher for an even smaller one? Yes, but with caveats. Distillation chains lose information at each step. A distilled 32B teaching a 7B student is fine if the 32B is strong; the 7B inherits much of the 32B's capability. A 7B teaching a 1B inherits less. The general rule: each distillation step adds noise; chain ~2 steps maximum before generating from the original frontier teacher again.

What's the relationship between distillation and quantization? Complementary. Distillation reduces parameter count; quantization reduces precision per parameter. A distilled small model + quantization is the standard production stack for cheap inference. Order: distill first (controls model capability), then quantize (controls compute cost). See quantization tradeoffs.

Is there a "synthetic data hygiene" checklist? Yes. (1) Deduplicate aggressively (MinHash + semantic). (2) Filter for quality (classifier-based or programmatic). (3) Detect and remove contamination from your eval sets. (4) Mix in real/human data (10–50% prevents collapse). (5) Audit for bias amplification. (6) Track provenance — what generator produced what example. (7) Hold out a never-seen real-data set for sanity checks. (8) Refresh quarterly; static synthetic data ages.

Will the legal pressure (NYT v. OpenAI, etc.) limit synthetic data? Possibly. If courts rule that training on copyrighted text without license is infringement, every frontier lab faces costs. Mitigations include licensing data (Apple, Adobe, Reddit have struck deals), synthetic data (no copyright on AI-generated text), and opting out via robots.txt (no legal weight but industry norm). The 2026–2027 trajectory will be partly shaped by litigation outcomes.

What replaces the web when we run out of useful web data? Three sources: (1) synthetic-and-verified (the R1 / Phi approach), (2) high-value licensed data (publishers, professional content, code), (3) interaction data (consenting user conversations, demonstrations). The "running out" framing is somewhat overstated — useful web data is still being created, just at a slower rate than model demand. The mix will shift toward (1) and (2) through 2026–2028.

How does Constitutional AI relate to synthetic preference data? Constitutional AI (Bai et al., 2022) is one of the earliest large-scale synthetic preference pipelines. The model generates a response, critiques it against a written "constitution" of principles, revises, and the revised version becomes the preferred response. The (original, revised) pairs form preference data used to train a reward model and policy. Anthropic uses this extensively; the technique scales without human labellers.

What is Magpie's "header-only" trick? Magpie (Xu et al., arXiv:2406.08464) discovered that prompting an instruction-tuned model with only the assistant template header (no user input) causes the model to generate a plausible user query followed by its own response. The resulting (query, response) pairs are higher diversity than Self-Instruct and cheaper. Magpie data trains 7–8B open-weight models competitive with much larger closed models on instruction following.

Can I distill Anthropic Claude legally? The Anthropic API terms prohibit using outputs to develop competing models. Distilling Claude into a model you'll sell as a competing chatbot is likely a ToS violation; using Claude for research or internal use cases is generally fine. Get legal review before commercial distillation.

What's the deal with the "model collapse" papers? Shumailov et al. (arXiv:2305.17493) showed that recursive training on AI-generated data degrades quality over generations in idealised settings. Subsequent work (Gerstgrasser et al., arXiv:2404.01413) showed that mixing synthetic with real data avoids collapse. In production, frontier labs use synthetic data heavily without collapse because they (a) mix with real data, (b) quality-filter aggressively, (c) use diverse generators. Model collapse is a real risk in unconstrained setups; a non-issue in disciplined ones.

Is there a standard data card for synthetic datasets? Hugging Face introduced data cards as a standard documentation format. For synthetic datasets specifically, useful fields: generator model and version, prompt template, filtering steps, acceptance rate, contamination check methodology, known limitations. The Tulu 3 and OpenMathInstruct-2 data cards are good public examples.

How do I generate synthetic data for low-resource languages? Three patterns: (1) translate English synthetic data with strong MT (NLLB, Madlad-400, GPT-5) and post-edit; (2) generate directly in target language using a model with strong support (Llama 4, GPT-5, Qwen3 have decent Vietnamese / Tagalog / Swahili); (3) human-bootstrap a seed dataset, then synthesise scaled-up with persona conditioning. Quality is usually substantially below English; budget for native-speaker review.

What's the role of dedup in synthetic-data pipelines? Critical. Synthetic data has high duplicate rates because LLM generators have favourite phrasings and concepts. Without dedup, you train on a narrower distribution than you think. Standard practice: exact match dedup, MinHash near-duplicate dedup, then semantic dedup (cluster embeddings, sample one per cluster). Aggressive dedup typically removes 30–60% of raw generated data.

Can I distill multimodal models? Yes. Multimodal distillation typically transfers vision-language capabilities through paired (image, caption / Q&A) data generated by a strong teacher. Llava family is partly built this way. The encoder is often kept frozen; only the projector and LLM portion distil. Quality follows similar patterns to text distillation.

What does Tulu 3 do differently? Tulu 3 (Lambert et al., AI2, 2024) is AI2's fully-open post-training recipe. It combines a 940k-example SFT mix (including synthetic from GPT-4o, GPT-4-turbo, Claude), DPO with synthetic preferences, and RLVR (RL with verifiable rewards) on math/code. The full pipeline, data, and weights are released — the closest public match to a frontier post-training recipe. Replicate it as a baseline.

Are there datasets specifically for tool-use training? Yes. ToolBench, API-Bank, ToolLLaMA, Glaive-function-calling, NexusRaven datasets all provide synthetic tool-use traces. Quality varies; ToolBench is large but noisy, Glaive-function-calling is cleaner. For agent training, ReAct-style trajectories from frontier models (GPT-4o, Claude with tool use) on real APIs produce the highest quality but cost the most.

What's RFT (Rejection-Sampling Fine-Tuning) in detail? RFT: for each prompt, sample N candidate completions from a model; filter to correct ones using a verifier; SFT on the filtered set. Iterate. Llama 2 and Llama 3 used RFT extensively. It's the simplest self-improvement loop; cheap to implement; works well when verifiers are reliable. Modern variants (ReST, GRPO) add reward modelling on top.

Does data quality matter more than quantity? For post-training: yes, by a wide margin. 50k carefully curated examples beat 5M unfiltered for SFT. For pretraining: less so — scale still matters, but the slope flattened and quality became the lever. The 2026 consensus: quality dominates after about 1T tokens of competent pretraining data; before that, quantity still wins.

Is contamination my biggest worry with synthetic data? For benchmark scores yes; for production quality less so. Contamination inflates benchmark scores but doesn't directly hurt user experience. Production quality is hurt by distribution narrowing, generator bias amplification, and verifier failures. Audit both: contamination via MinHash on benchmark texts, production quality via held-out real-data evals.

What's the typical compute cost of training a 7B distilled model? Llama-3.1-8B-class training compute is roughly 1.4 × 10^23 FLOPs. At 50% MFU on H100s, that's ~3,000 H100-days = 72k H100-hours = ~$200k at on-demand cloud or ~$80k on owned hardware. Distillation (SFT only on 100–500k examples) is far cheaper — typically $5–50k total compute. Post-training is much smaller than pretraining; most teams distil on existing pretrained bases.

Can I create synthetic data with smaller open models like Mistral 7B? Yes for some tasks; quality is bounded by the generator's own capability. Small models work well for: simple reformatting, structured extraction, basic translation, classification. They fail for: complex reasoning, deep factual content, multi-step verification. The right size of generator depends on what's being generated; sometimes Mistral 7B is fine, sometimes you need GPT-5.

What's the future of synthetic data — 2027 and beyond? Three trajectories: (1) verifiable rewards expand beyond math/code into more domains (science, planning) via richer verifiers; (2) self-improvement loops mature, with multi-step bootstrap producing models that surpass their initial teachers on verifiable tasks; (3) hybrid synthetic-real pipelines become standard — synthetic data for volume and capability shaping, real data for distribution coverage. Pretraining mixes will continue to shift toward synthetic; post-training is already mostly synthetic.

Is open-source synthetic data catching up to closed-lab quality? Partly. For verifiable domains (math, code), open synthetic data (OpenMathInstruct-2, OpenCoder data, R1 traces) matches or exceeds what closed labs use publicly. For open-ended quality (creative writing, nuanced helpfulness), closed labs maintain an edge due to proprietary human-feedback data and longer-running RLAIF loops. The gap is narrowing.


Glossary

  • Bootstrap — iterative self-improvement loop where the model generates training data for its successor.
  • Distillation — training a smaller student model to mimic a larger teacher.
  • Hard distillation — training on teacher-generated text as SFT data.
  • Model collapse — degradation from training on increasingly synthetic data.
  • Persona data — synthetic multi-turn dialogues.
  • Self-instruct — generating instruction-following examples from seed examples plus an LLM.
  • Soft distillation — matching teacher's output distribution via KL divergence.
  • STaR — Self-Taught Reasoner; bootstrap loop for reasoning capability.
  • Synthetic data — model-generated training examples.
  • Verifier — automatic check for example correctness (test runner, math checker, etc.).

References

  • Self-Instruct — Wang et al., 2022. "Self-Instruct: Aligning Language Models with Self-Generated Instructions." arXiv:2212.10560.
  • Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean, 2015. arXiv:1503.02531. The foundational distillation paper.
  • STaR — Zelikman et al., 2022. "STaR: Bootstrapping Reasoning With Reasoning." arXiv:2203.14465.
  • The Curse of Recursion — Shumailov et al., 2023. "The Curse of Recursion: Training on Generated Data Makes Models Forget." arXiv:2305.17493. The model-collapse paper.
  • Textbooks Are All You Need — Gunasekar et al., 2023. arXiv:2306.11644. Microsoft's phi-1 paper; demonstrates synthetic-textbook approach.
  • Alpaca — Taori et al., 2023. Stanford project demonstrating Self-Instruct on Llama. crfm.stanford.edu/2023/03/13/alpaca.html.
  • Orca — Mukherjee et al., 2023. "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." arXiv:2306.02707. Reasoning distillation.
  • DeepSeek-R1 — DeepSeek-AI, 2025. arXiv:2501.12948. Distillation traces from R1 to smaller models.
  • WizardLM Evol-Instruct — Xu et al., 2023. "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions." arXiv:2304.12244. Instruction-evolution approach.
  • TinyStories — Eldan, Li, 2023. arXiv:2305.07759. Demonstrates synthetic-data training for very small models.
  • DistilBERT — Sanh et al., 2019. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv:1910.01108. Foundational logit-distillation for transformers.
  • Magpie — Xu et al., 2024. "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing." arXiv:2406.08464.
  • Phi-3 — Abdin et al., 2024. "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." arXiv:2404.14219. Synthetic-data-heavy small models.
  • Constitutional AI — Bai et al., 2022. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073.
  • Self-Rewarding Language Models — Yuan et al., 2024. arXiv:2401.10020.
  • GPT-4 System Card — OpenAI, 2023. cdn.openai.com/papers/gpt-4-system-card.pdf. Public discussion of synthetic red-team data and safety pipeline.

Persona-driven generation: Microsoft Persona Hub

Microsoft Persona Hub (Chan et al., arXiv:2406.20094) is a persona-driven approach to scaling synthetic data. The idea: maintain a library of ~1B distinct personas (each a short description of a person with attributes, interests, profession, expertise), then prompt a generator LLM with a persona + a task. The persona conditions the output, producing diverse data points that wouldn't emerge from naive sampling.

Why personas matter

Without persona conditioning, an LLM generating instructions for "math questions" produces a narrow distribution centred on its training data's modal math questions. With persona conditioning ("a high-school physics teacher", "a financial analyst at a hedge fund", "a 9-year-old curious about space"), the same generator produces wildly different questions that cover much more of the input space.

Persona Hub demonstrates diversity gains across:

  • Math instruction generation (MATH benchmarks)
  • Knowledge-intensive QA
  • Tool-use instruction generation
  • Creative writing prompts
  • Game and puzzle generation

Building a persona library

A persona library can be created in two ways:

  1. Mining: extract personas from web text using a small classifier ("does this text describe a specific person's situation?"). Yields broad, naturally occurring personas.
  2. Synthesising: prompt a generator LLM to invent personas with structured attributes (occupation, expertise, hobbies, communication style). Yields cleaner, more diverse personas; risks artificial sterility.

Microsoft's release includes 200k seed personas; full pipeline scaling to 1B+ is described in the paper.

Practical use

For teams building synthetic-data pipelines: a few thousand personas yield most of the diversity benefit. Sample a persona per generation; condition the prompt on it; filter for novelty. Quality lift over flat sampling is 10–25% on diverse downstream evals.

Limitations

Personas don't fix verifier-free tasks. If the generated example needs to be correct (math, code), persona diversity doesn't help — only verification does. Personas help most for open-ended tasks where the target is "a representative sample of all possible such tasks."


Math-specific synthetic data: OpenMathInstruct, MetaMath, MathPile

Math is the canonical synthetic-data success story because answers verify automatically and the gap between web-data and synthetic-data quality is largest.

OpenMathInstruct-1 and OpenMathInstruct-2

OpenMathInstruct (Toshniwal et al., NVIDIA, 2024) — a 1.8M example dataset of math problems with code-augmented chain-of-thought solutions, generated by Mixtral-8x7B and filtered for correctness against ground truth.

OpenMathInstruct-2 (Toshniwal et al., 2024) scales to 14M examples generated by Llama 3.1 405B. Used to train OpenMath-Llama models that approach frontier math performance.

MetaMath

MetaMath (Yu et al., arXiv:2309.12284) bootstraps math problems by question-rewriting (forward/backward reasoning, augmented variants) on GSM8K and MATH seeds. ~395k examples; modest by 2026 standards but pioneered the rewrite-as-augmentation approach.

MathPile

MathPile (Wang et al., arXiv:2312.17120) — 9.5B-token corpus of mathematical content scraped from textbooks, papers, Stack Exchange, and forums; not synthetic, but the curated foundation many synthetic pipelines build on.

NuminaMath

NuminaMath (2024) — competition-grade math problems and reasoning traces. NuminaMath 1.5 includes 860k Olympiad-style problems; used in DeepSeek-Prover-V2 and several open math models.

Recipe

A competitive open math model recipe in 2026:

  1. Pretrain on MathPile + general web data.
  2. Mid-train on OpenMathInstruct-2 + NuminaMath synthetic CoT data.
  3. RL with verifiable rewards on competition problems (GRPO or PPO).
  4. Final SFT on a small high-quality eval-like dataset.

End result: 7B-class models scoring 70%+ on MATH and 50%+ on AIME. Reasoning models (R1-Distill-Qwen-7B) score even higher.


Code-specific data: DeepSeek-Coder, StarCoder2, OpenCoder

Code data follows a similar pattern: web-scraped code as a base, synthetic augmentation for instruction following and chain-of-thought.

DeepSeek-Coder corpus

DeepSeek-Coder (Guo et al., 2024) trained on 2T tokens of code from 87 languages. Synthetic augmentation includes generated unit tests, generated docstrings, and synthetic instruction-tuning data derived from GitHub commits + LLM-generated descriptions.

StarCoder2 and The Stack v2

StarCoder2 (BigCode, 2024) trained on The Stack v2 — 4T tokens of permissively-licensed code from 658 languages. Open data, open weights, used in many code-specialised models.

OpenCoder

OpenCoder (Inf-tech, 2024) — fully open code model with documented data pipeline including synthetic instruction data generated by Qwen2-72B and DeepSeek-Coder-V2.

Synthetic code instruction pipelines

The canonical recipe:

  1. Sample a function or short program from a code corpus.
  2. Generate a natural-language description of what the code does (LLM).
  3. Generate one or more variant prompts that would lead to that code.
  4. Generate unit tests; verify the code passes (code execution).
  5. Keep only verified (prompt, code, tests) triples.

Variants: WizardCoder uses Evol-Instruct to evolve code prompts; Code-Alpaca uses Self-Instruct seeded with code tasks; OpenCodeInterpreter generates multi-turn debug traces.

Repo-level data

For agentic coding (Claude Code, Devin, Cursor agent mode), repository-level training data — not just function-level — matters. SWE-bench, SWE-Gym, and similar provide hundreds of thousands of real GitHub issues with PRs as training data for agent behaviour.


Contamination detection in depth: substring, MinHash, perplexity, BLEU

Test-set contamination is the single biggest threat to reported benchmark scores. Detection techniques:

Exact substring matching

Search the training corpus for verbatim test-set strings. Cheap, catches the dumb case. Defeated by minor rewordings.

MinHash and LSH

MinHash (Broder, 1997) compares document similarity via hashed k-shingles. Effective at finding near-duplicates with small edit distances. LSH (Locality-Sensitive Hashing) scales MinHash to billions of documents. The standard contamination check in most labs.

Perplexity anomaly

If a model has memorised a test example, its perplexity on that example is anomalously low. Compare per-example perplexity on the test set vs a clean holdout from the same distribution. Statistically significant low-perplexity outliers indicate contamination.

BLEU and ROUGE matching

For text-generation tasks, BLEU or ROUGE between generated outputs and reference outputs over the test set. Suspiciously high scores on specific examples suggest memorisation.

Model fingerprinting

Embed sentinel phrases ("benchmark canary tokens") in test sets at creation time; check if models output them. Used by the BIG-bench team and several benchmark organisations.

Cross-benchmark consistency check

A model trained without contamination should perform similarly on a benchmark and a paraphrased version. Large gaps indicate the original benchmark text was in training data.

What to do if you find contamination

  • Re-train without the contaminated data and re-evaluate (expensive but cleanest).
  • Hold out a fresh paraphrased test set and report on it.
  • Note the contamination in the model card.
  • Adopt continuous-benchmarks approach (LiveBench, AIME competitions held after model release) to side-step the problem.

Contamination rates in the wild

A 2024 analysis (Bordt et al., arXiv:2402.11814) found contamination of standard benchmarks in major open and closed models ranging from 1–30% depending on the benchmark and the model. Treat headline scores on heavily-published benchmarks (MMLU, HellaSwag, GSM8K, HumanEval) with skepticism; weight LiveBench and dynamic benchmarks higher.


R1-Distill model card deep dive: AIME numbers, size scaling

DeepSeek-R1's distillation produced a family of smaller reasoning models. Public numbers from the R1 paper:

Model AIME 2024 (pass@1) MATH-500 GPQA Diamond LiveCodeBench
DeepSeek-R1 (671B MoE) 79.8 97.3 71.5 65.9
R1-Distill-Qwen-32B 72.6 94.3 62.1 57.2
R1-Distill-Qwen-14B 69.7 93.9 59.1 53.1
R1-Distill-Llama-70B 70.0 94.5 65.2 57.5
R1-Distill-Qwen-7B 55.5 92.8 49.1 37.6
R1-Distill-Llama-8B 50.4 89.1 49.0 39.6
R1-Distill-Qwen-1.5B 28.9 83.9 33.8 16.9

(Numbers from the DeepSeek-R1 technical report; verify on the original paper for exact reproduction settings.)

What R1-Distill demonstrates

  • Reasoning capability transfers via SFT on long chain-of-thought traces.
  • Quality scales with parameter count but smaller models retain much of the capability — Qwen-32B distilled gets 91% of R1's MATH-500 and 91% of R1's AIME.
  • The technique is reproducible: third-party R1-style distillations (e.g., based on QwQ-32B or open R1 reproductions) appeared within months of the paper.

Distillation method

R1-Distill is response-only SFT: collect long reasoning traces from R1 on math, code, and reasoning problems; SFT the student on those traces. No RL on the student. Simple to implement.

Limitations

  • Distillation passes the teacher's blind spots to the student.
  • Distillation passes the teacher's hallucinations to the student.
  • The student often produces traces that look correct but contain subtle errors masked by the response style.

For production deployment, R1-Distill models are excellent cost-quality trade-offs for math and reasoning workloads where frontier reasoning quality is unaffordable.


Anthropic's Haiku distillation pipeline (what's public)

Anthropic has not published a Haiku training paper. From public statements, blog posts, and reasonable inference:

  • Haiku models are trained with Opus and Sonnet outputs in the post-training mix.
  • The pipeline emphasises Constitutional AI-style alignment carried over from larger models.
  • Quality preservation focuses on instruction-following, safety behaviour, and the most common chat-style tasks.
  • Haiku 4.5 (October 2025) markedly closes the gap with Sonnet on standard benchmarks.

Inferred recipe

Based on industry-standard practice in 2025–2026:

  1. Pre-train Haiku at small parameter count.
  2. SFT on a mix of human-labelled and Sonnet/Opus-generated examples.
  3. RLHF or RLAIF from preferences derived from Sonnet/Opus comparison data.
  4. Constitutional AI training on synthetic critiques.

Anthropic publishes neither the parameter count nor the data mix. Treat the above as informed speculation.

What's public

Anthropic's published research on Constitutional AI (Bai et al., 2022) and the Responsible Scaling Policy form the basis of the alignment side of the pipeline. The model card publishes evaluation results but not training recipe.

Why Haiku matters for the field

Haiku 4.5 demonstrates that a small model with the right post-training mix can match much larger models on most tasks at a fraction of the inference cost. The recipe — even at the rough sketch level — is influential across the industry.


Self-improvement: RFT, ReST, RLAIF

Self-improvement loops train a model on its own filtered outputs. Three named variants.

Rejection-sampling Fine-Tuning (RFT)

The simplest self-improvement loop:

  1. Generate N candidate outputs per prompt.
  2. Filter to the correct ones using a verifier.
  3. SFT on the filtered correct outputs.

Iterate. RFT raises capability on verifiable tasks (math, code) at the cost of more inference per training example. Used in Llama 2, Llama 3, DeepSeek's pipelines, and many open recipes.

ReST (Reinforced Self-Training, Google DeepMind)

ReST (Gulcehre et al., arXiv:2308.08998) alternates between a "Grow" step (generate candidates) and an "Improve" step (filter and SFT). Adds explicit ranking and reward modelling between iterations.

ReST^EM (ReST with Expectation-Maximisation) extends to settings where the verifier is a reward model, not a binary correctness check.

RLAIF (RL from AI Feedback)

RLAIF (Lee et al., arXiv:2309.00267) replaces human preference labels with LLM-generated preferences. A judge LLM (often the same or a slightly stronger model) compares two outputs; the preferences train a reward model; the reward model trains the policy via PPO or DPO.

RLAIF demonstrates near-RLHF quality at a fraction of the human-labelling cost. Constitutional AI (Bai et al., 2022) is the canonical RLAIF instantiation: a critique-and-revise loop generates AI-revised completions used as preference data.

Self-Rewarding Language Models

Self-Rewarding LMs (Yuan et al., arXiv:2401.10020) train one model to act as both policy and judge, iteratively improving both via DPO on self-generated preference data. Shows quality gains over 3 iterations.

Limits

Self-improvement loops are bounded by:

  • The verifier's quality (garbage verifier → garbage data).
  • The base model's reasoning frontier (you can't bootstrap above what your model can ever generate correctly).
  • Diversity collapse (the loop converges on a narrow output distribution).

Practical advice: use self-improvement for verifiable domains (math, code, structured reasoning); use human preferences for taste-driven domains (writing quality, safety nuance, multi-turn dynamics).


Quality classifiers: fastText, cleanlab, vendor pipelines

Quality filtering at scale uses lightweight classifiers, not the generator LLM itself.

fastText classifiers

fastText (Joulin et al., 2016) is a CPU-friendly classifier widely used for data filtering. Train on a small set of labelled high-quality vs low-quality documents; apply to the full corpus.

FineWeb-Edu (HuggingFace) uses a fastText educational-quality classifier to filter pretraining web data. The classifier is trained on Llama-3-70B-Instruct labels of educational quality.

cleanlab

cleanlab (Northcutt et al., 2017+) is an open-source data-quality library that finds mislabelled examples via predicted-probability analysis. Used in supervised datasets to flag suspect labels.

Perplexity filtering

Score each document with a small reference LM; filter out documents with perplexity above or below thresholds. High-perplexity documents are often noise; very-low-perplexity documents may be memorised duplicates of training data.

Embedding-based filtering

Embed all documents; cluster; identify low-quality clusters by sampling. Used in the Cosmopedia and Nemotron-CC pipelines.

Vendor pipelines

  • NVIDIA NeMo Data Curator: end-to-end deduplication, quality scoring, contamination check.
  • Hugging Face DataTrove: open-source data processing for large-scale pre-training data.
  • AWS Glue / Databricks: general-purpose data pipelines used as the substrate for filtering.

Quality lift from filtering

Aggressive quality filtering typically removes 50–90% of web-scraped data. The remaining 10–50% trains models that match or exceed quality of unfiltered training at a fraction of the compute. The lift comes from concentration of high-quality signal.


WildChat and real-conversation datasets

Real user conversations are scarce because they're private. Two datasets that crack this open:

WildChat

WildChat (Zhao et al., AI2, 2024) — 1M+ real conversations between users and GPT-3.5/GPT-4, captured via the WildChat playground with user consent. Diverse, multilingual, includes edge cases that synthetic data rarely surfaces.

LMSYS-Chat-1M

LMSYS-Chat-1M (Zheng et al., 2023) — 1M conversations with 25 different LLMs from the Chatbot Arena. Captures comparative behaviour across models and real user queries.

ShareGPT

ShareGPT (community-shared ChatGPT conversations) — used to train Vicuna and many open chatbots. Quality varies; older and skewed toward power-user prompts.

OASST and OpenAssistant Conversations

OpenAssistant Conversations (OASST) (Köpf et al., arXiv:2304.07327) — human-generated conversations and preferences released openly. ~600k messages. Foundation for many open RLHF datasets.

Practical use

Real-conversation data complements synthetic data: synthetic data dominates on volume and verifiability; real-conversation data covers edge cases and distribution that synthetic generators miss. Production open-weight post-training mixes commonly include 10–30% real-conversation data alongside synthetic.


Synthetic preference data: UltraFeedback, Nectar, AI feedback

RLHF and DPO require preference data (chosen vs rejected pairs). Generating preferences synthetically has become the norm for open recipes.

UltraFeedback

UltraFeedback (Cui et al., arXiv:2310.01377) — 64k prompts each scored by GPT-4 across multiple completions from 17 different models. The most-used open preference dataset for DPO training of open-weight models.

Nectar

Nectar (Berkeley, 2024) — 183k prompts with 7 responses each, ranked by GPT-4. Even larger preference dataset; used to train Starling models.

HH-RLHF (Anthropic)

HH-RLHF (Bai et al., 2022) — 170k human-written preferences on helpfulness and harmlessness. Anthropic's foundational RLHF dataset.

Constitutional AI preferences

Synthetic preferences derived from Constitutional AI's critique-and-revise loop: the model critiques its own output against a principle, revises, and the revised version is preferred. Constitutional preferences scale without human labelling; Anthropic uses extensive synthetic preference data.

Quality controls

  • Use a strong judge model; weak judges produce noisy preferences.
  • Use multiple judges and majority-vote on disagreements.
  • Calibrate the judge against a held-out human-labelled set; require >75% agreement.
  • Filter out ties and ambiguous cases.

Cost

Generating 100k synthetic preferences via GPT-4 / Claude judge: ~$2k–10k depending on prompt length. Versus human preferences: $1–3 per labelled pair = $100–300k. Synthetic preferences are 20–100× cheaper.


Cost per accepted example: domain-by-domain

The unit economics of synthetic data depend heavily on domain. A worked table:

Domain Generator Acceptance rate Cost per gen Cost per accepted example
General instruction following GPT-5 ~80% $0.005 $0.0063
Math problems with verifier Claude Sonnet 4.6 ~30% $0.003 $0.010
Code with unit tests DeepSeek-Coder ~40% $0.001 $0.0025
Multi-turn conversations GPT-5 ~70% $0.020 $0.029
Persona-driven creative writing Claude Opus 4.x ~85% $0.05 $0.059
Long chain-of-thought reasoning o3 / R1 ~50% $0.05 $0.10
Safety / red-team prompts GPT-5 ~60% $0.01 $0.017

Versus human labelling

Task Cost per human-labelled example
Simple classification $0.05–$0.20
Instruction-response pair (mid-quality) $0.50–$2
Complex reasoning trace $5–$50
Expert-domain (medical, legal) $20–$200

Crossover point: synthetic generation is 10–100× cheaper than human labelling for most tasks. The exception: expert-domain data where neither humans nor LLMs are reliable; quality requires both.

Filtering economics

Filter rejection rates dominate cost. A 30% acceptance rate triples cost per accepted example vs 90%. Investing in better prompts (raising acceptance) often pays back faster than investing in cheaper generators.

When to generate vs label vs scrape

  • Verifiable (math, code): generate + verify; cheapest.
  • Open-ended structured (instruction following, creative writing): generate, persona-condition, filter.
  • Expert-domain factual: human-label by experts; do not synthesise.
  • High-frequency edge cases: real-conversation data (WildChat, ShareGPT-style).
  • Distribution coverage gaps: targeted synthetic generation with persona conditioning.