Synthetic data with LLMs is one of those “obvious in retrospect” capabilities. Done right, it accelerates eval, fine-tuning, edge case discovery. Done wrong, you get model collapse, biased data, and wasted training compute. This post is the working playbook.
When synthetic data helps
- Eval set generation: 500 diverse cases for testing.
- Fine-tune data augmentation: small starter set → expanded.
- Edge case enumeration: “what could go wrong?”
- Test data for development: realistic but synthetic users / orders / events.
- Distribution shifting: rare scenarios for training.
When it hurts
- Pure synthetic training loops: model collapse — the data converges to model’s biases.
- Foundational pre-training: original sources irreplaceable.
- Creative / open-ended training: synthetic data lacks diversity.
- Without human-in-the-loop validation: errors compound.
Eval set generation
async def generate_eval_cases(seed_examples, n=100):
cases = []
for _ in range(n):
prompt = f"""
Given these seed examples, generate a similar but distinct test case:
{format_examples(random.sample(seed_examples, 3))}
Output a single test case in the same format. Make it diverse — different topic, edge case, or phrasing.
"""
case = await llm.generate(prompt, temperature=0.8)
cases.append(case)
return cases
Then human review. Reject 20-30% as off-target. Keep the rest.
For 500 hand-curated cases (week of work) → 500 LLM-generated + reviewed (day of work).
Fine-tune data augmentation
async def augment(seed_pair, n=5):
"""Generate variations of a seed (input, output) pair."""
prompt = f"""
Given this example:
Input: {seed_pair['input']}
Output: {seed_pair['output']}
Generate {n} variations with different wordings but the same intent.
"""
return await parse_examples(await llm.generate(prompt))
50 hand-written examples + 5 variations each = 300 total. Validate quality. Use for fine-tuning. See LLM Fine-Tuning .
Edge case generation
async def find_edge_cases(prompt_template, n=20):
cases = []
instructions = [
"Test boundary conditions (empty, max length, unicode).",
"Test ambiguity / multiple valid interpretations.",
"Test prompt injection attempts.",
"Test out-of-distribution inputs.",
"Test culturally sensitive scenarios.",
]
for inst in instructions:
cases.extend(await llm.generate_cases(prompt_template, inst, n=4))
return cases
Models are good at imagining edge cases. Use them.
Test data for dev
async def gen_users(n=100):
return await llm.batch_generate(
prompt="Generate a realistic-looking user record with fields name, email, age, country, signup_date.",
n=n,
)
Avoid PII from production; get diversity for free; synthesize relationships:
users = await gen_users(100)
orders = await gen_orders(users, 1000) # references users
For dev / staging seed data: brilliant.
Quality control
Synthetic data quality varies. Always:
- Human spot-check: 10% of generated examples reviewed.
- Diversity metrics: are generated cases similar (bad) or distinct (good)?
- Bias check: does the LLM over-generate certain patterns?
- Filtering: remove duplicates, off-topic, low-quality.
Model collapse
If you train a model on its own outputs, then train again, then again — quality degrades.
Model v1 → trained on real data → high quality, diverse.
Model v2 → trained on v1's outputs → narrower, biased toward v1's mistakes.
Model v3 → trained on v2's outputs → much narrower.
Mitigations:
- Always include real data in the training mix.
- Multiple source models: use different LLMs to generate.
- Human-in-the-loop validation.
- Diversity-prompting: explicitly ask for varied outputs.
For fine-tuning narrow tasks with synthetic-augmented data + real validation: usually fine. For pre-training: strict caution.
Distillation
Use a strong model to generate training data; fine-tune a small model on it.
# Teacher: Sonnet
teacher_responses = []
for prompt in dataset:
resp = await sonnet.complete(prompt)
teacher_responses.append({"input": prompt, "output": resp})
# Student: Llama 3.1 8B fine-tuned on teacher's data
fine_tune(base="llama-3.1-8b", data=teacher_responses)
Student approaches teacher’s quality on the narrow task at fraction of cost. See LLM Cost Optimization .
Common pattern in 2026 for cost optimization.
Rephrasing for diversity
async def diversify_prompt(template, n=10):
return await llm.generate(f"""
Rephrase this prompt 10 times, keeping the intent identical but varying:
- Politeness / tone
- Length
- Word choice
- Grammar (formal vs casual)
Original: {template}
""")
Robustness training: model sees the same intent in many forms.
Adversarial generation
async def find_breaking_inputs(model_endpoint):
"""Have an LLM generate inputs that might break the model."""
return await attacker_llm.generate(f"""
You are testing a customer support classifier. Generate 20 user messages designed to confuse it:
- Mixed languages
- Sarcasm
- Multiple intents
- Ambiguous references
Output JSON list.
""")
Use as red-team eval. See LLM Guardrails .
Synthetic vs real
| Real | Synthetic | |
|---|---|---|
| Cost | Annotation $$$$ | LLM $ |
| Distribution | True | Approximated |
| Diversity | Genuine | Limited |
| PII | Yes | None |
| Volume | Bounded | Unlimited |
| Quality control | Established | Evolving |
Best of both: synthetic-augmented real data. Real ground truth + synthetic variations.
Common mistakes
1. Pure synthetic training loops
Model degrades. Always mix in real data.
2. No validation
Trust LLM-generated data blindly; train; quality regressions ship.
3. One-temperature generation
temperature=0 → repetitive. temperature=1 → noisy. Sweep for diversity.
4. Over-generating from one prompt template
Variance comes from variant prompts, not just temperature.
5. Synthesizing PII patterns
LLMs sometimes generate plausible real names / SSNs. Use clearly synthetic data (“[email protected] ”) for tests.
What I’d ship today
For LLM-augmented data work:
- Eval generation with human review → CI test set.
- Fine-tune augmentation for narrow tasks.
- Distillation for cost reduction.
- Diversity prompting + temperature sweeps.
- Quality filters post-generation.
- Real-data anchor in any training mix.
Read this next
- LLM Fine-Tuning LoRA / QLoRA
- LLM Evaluation Frameworks 2026
- LLM Cost Optimization 2026
- LLM Guardrails 2026
If you want my synthetic data + eval generation pipeline, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .