๐Ÿƒ From Human-Written to Machine-Generated Pretraining Data

A comprehensive survey published by Epoch AI on May 20, 2026, covering 14 major AI laboratories, finds that synthetic data now accounts for 78% of all tokens used in frontier LLM pretraining, a dramatic increase from 32% in 2024 and under 5% in 2022. The shift is driven by three converging trends: the exhaustion of high-quality human-written internet text, the development of reliable synthetic generation pipelines using reward models and verifiers, and the demonstrated quality improvements in reasoning benchmarks when models are trained on synthetic chain-of-thought corpora.

DeepSeek's technical report for V4 Pro explicitly states that 84% of its 14.8 trillion training tokens were synthetically generated using DeepSeek V3 with RL-verified math and code outputs. Meta's Llama 4 team reports a similar 76% synthetic ratio, with human-written data concentrated in the fine-tuning stage. Anthropic's Claude 4 system card describes a "Constitutional Generation" pipeline where a suite of LLMs generate training data constrained by explicit style, factual accuracy, helpfulness, and harmlessness rules, with outputs rejected if they fail automated verification.

๐Ÿ“‹ Generation Methods: Self-Play, Verifiers, and Constitutions

The most impactful technique is iterative self-play with verifiable rewards: a model generates multiple solutions to a math problem or coding task, a verifier (unit tests for code, symbolic solvers for math) scores each solution, and only correct solutions with high-quality reasoning chains enter the training corpus. This approach has been shown to improve MATH benchmark scores by 10-15 percentage points over training on static human-written problem-solution pairs.

For open-ended domains like creative writing, summarization, and dialogue, labs have converged on constitutional generation frameworks. A "generator" model produces candidate text; a "critic" model (often the same architecture with a different system prompt) evaluates for style, factual accuracy, and policy compliance; and only approved outputs enter the training pool. Google DeepMind's Gemma 3 technical report describes a recursive variant where critic feedback is used to refine generator outputs over multiple rounds, producing training data of higher quality than any single model could generate directly.

๐Ÿง  Model Collapse and Contamination Risks

Critics including UC Berkeley's Hany Farid and Oxford's Yarin Gal have raised concerns about "model collapse"โ€”the phenomenon where models trained on synthetic data from previous-generation models lose diversity and amplify subtle errors over successive generations. A 2025 Nature paper demonstrated that after five generations of synthetic-data training, language models lost the ability to generate text about tail-distribution topics and exhibited increased hallucination rates.

Labs have responded with "freshness" requirements: incorporating at least 15-20% human-verified data and continuously generating synthetic data from multiple diverse model families rather than self-consuming their own outputs. But as human text on the open web becomes increasingly interleaved with LLM-generated content, maintaining a clean training signal becomes increasingly difficult.