Synthetic data with LLMs is one of those “obvious in retrospect” capabilities. Done right, it accelerates eval, fine-tuning, edge case discovery. Done wrong, you get model collapse, biased data, and wasted training compute. This post is the working playbook.

When synthetic data helps

  • Eval set generation: 500 diverse cases for testing.
  • Fine-tune data augmentation: small starter set → expanded.
  • Edge case enumeration: “what could go wrong?”
  • Test data for development: realistic but synthetic users / orders / events.
  • Distribution shifting: rare scenarios for training.

When it hurts

  • Pure synthetic training loops: model collapse — the data converges to model’s biases.
  • Foundational pre-training: original sources irreplaceable.
  • Creative / open-ended training: synthetic data lacks diversity.
  • Without human-in-the-loop validation: errors compound.

Eval set generation

async def generate_eval_cases(seed_examples, n=100):
    cases = []
    for _ in range(n):
        prompt = f"""
Given these seed examples, generate a similar but distinct test case:

{format_examples(random.sample(seed_examples, 3))}

Output a single test case in the same format. Make it diverse — different topic, edge case, or phrasing.
"""
        case = await llm.generate(prompt, temperature=0.8)
        cases.append(case)
    return cases

Then human review. Reject 20-30% as off-target. Keep the rest.

For 500 hand-curated cases (week of work) → 500 LLM-generated + reviewed (day of work).

Fine-tune data augmentation

async def augment(seed_pair, n=5):
    """Generate variations of a seed (input, output) pair."""
    prompt = f"""
Given this example:
Input: {seed_pair['input']}
Output: {seed_pair['output']}

Generate {n} variations with different wordings but the same intent.
"""
    return await parse_examples(await llm.generate(prompt))

50 hand-written examples + 5 variations each = 300 total. Validate quality. Use for fine-tuning. See LLM Fine-Tuning .

Edge case generation

async def find_edge_cases(prompt_template, n=20):
    cases = []
    instructions = [
        "Test boundary conditions (empty, max length, unicode).",
        "Test ambiguity / multiple valid interpretations.",
        "Test prompt injection attempts.",
        "Test out-of-distribution inputs.",
        "Test culturally sensitive scenarios.",
    ]
    for inst in instructions:
        cases.extend(await llm.generate_cases(prompt_template, inst, n=4))
    return cases

Models are good at imagining edge cases. Use them.

Test data for dev

async def gen_users(n=100):
    return await llm.batch_generate(
        prompt="Generate a realistic-looking user record with fields name, email, age, country, signup_date.",
        n=n,
    )

Avoid PII from production; get diversity for free; synthesize relationships:

users = await gen_users(100)
orders = await gen_orders(users, 1000)  # references users

For dev / staging seed data: brilliant.

Quality control

Synthetic data quality varies. Always:

  1. Human spot-check: 10% of generated examples reviewed.
  2. Diversity metrics: are generated cases similar (bad) or distinct (good)?
  3. Bias check: does the LLM over-generate certain patterns?
  4. Filtering: remove duplicates, off-topic, low-quality.

Model collapse

If you train a model on its own outputs, then train again, then again — quality degrades.

Model v1 → trained on real data → high quality, diverse.
Model v2 → trained on v1's outputs → narrower, biased toward v1's mistakes.
Model v3 → trained on v2's outputs → much narrower.

Mitigations:

  • Always include real data in the training mix.
  • Multiple source models: use different LLMs to generate.
  • Human-in-the-loop validation.
  • Diversity-prompting: explicitly ask for varied outputs.

For fine-tuning narrow tasks with synthetic-augmented data + real validation: usually fine. For pre-training: strict caution.

Distillation

Use a strong model to generate training data; fine-tune a small model on it.

# Teacher: Sonnet
teacher_responses = []
for prompt in dataset:
    resp = await sonnet.complete(prompt)
    teacher_responses.append({"input": prompt, "output": resp})

# Student: Llama 3.1 8B fine-tuned on teacher's data
fine_tune(base="llama-3.1-8b", data=teacher_responses)

Student approaches teacher’s quality on the narrow task at fraction of cost. See LLM Cost Optimization .

Common pattern in 2026 for cost optimization.

Rephrasing for diversity

async def diversify_prompt(template, n=10):
    return await llm.generate(f"""
Rephrase this prompt 10 times, keeping the intent identical but varying:
- Politeness / tone
- Length
- Word choice
- Grammar (formal vs casual)

Original: {template}
""")

Robustness training: model sees the same intent in many forms.

Adversarial generation

async def find_breaking_inputs(model_endpoint):
    """Have an LLM generate inputs that might break the model."""
    return await attacker_llm.generate(f"""
You are testing a customer support classifier. Generate 20 user messages designed to confuse it:
- Mixed languages
- Sarcasm
- Multiple intents
- Ambiguous references

Output JSON list.
""")

Use as red-team eval. See LLM Guardrails .

Synthetic vs real

RealSynthetic
CostAnnotation $$$$LLM $
DistributionTrueApproximated
DiversityGenuineLimited
PIIYesNone
VolumeBoundedUnlimited
Quality controlEstablishedEvolving

Best of both: synthetic-augmented real data. Real ground truth + synthetic variations.

Common mistakes

1. Pure synthetic training loops

Model degrades. Always mix in real data.

2. No validation

Trust LLM-generated data blindly; train; quality regressions ship.

3. One-temperature generation

temperature=0 → repetitive. temperature=1 → noisy. Sweep for diversity.

4. Over-generating from one prompt template

Variance comes from variant prompts, not just temperature.

5. Synthesizing PII patterns

LLMs sometimes generate plausible real names / SSNs. Use clearly synthetic data (“[email protected] ”) for tests.

What I’d ship today

For LLM-augmented data work:

  • Eval generation with human review → CI test set.
  • Fine-tune augmentation for narrow tasks.
  • Distillation for cost reduction.
  • Diversity prompting + temperature sweeps.
  • Quality filters post-generation.
  • Real-data anchor in any training mix.

Read this next

If you want my synthetic data + eval generation pipeline, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .