Fine-tuning got cheap enough by 2026 that small teams ship fine-tuned models. The mechanics are well-trodden; the success patterns are clear. This post is the working playbook.

When to fine-tune

Don’t, until you’ve tried:

  1. Prompting + good examples.
  2. Structured output via tool calling.
  3. RAG for fresh facts.
  4. Routing (use bigger model for hard).

After that, fine-tune when:

  • A narrow task runs at high volume; you want a smaller cheaper model that matches the big one.
  • The model resists a specific style / format that prompting can’t fix.
  • You have a domain (legal, medical) where vocabulary matters.

See Fine-Tuning vs RAG vs Prompting .

LoRA / QLoRA

LoRA (Low-Rank Adaptation) trains a small set of adapter weights on top of a frozen base. ~0.1–1% of full-tune compute; comparable quality.

QLoRA quantizes the base model to 4-bit while training adapters. ~70% memory reduction; trains a 70B model on a single H100.

For 2026, QLoRA on a 7B–14B base is the sweet spot for most teams.

Tools

Strengths
AxolotlPopular open-source; YAML config
UnslothFaster training; good ergonomics
HuggingFace TRLLibrary; build-your-own-trainer
Modal LabsGPU-as-a-service
Together AI / AnyscaleHosted fine-tuning

For self-hosted: Axolotl + Unsloth + Modal.

Training data

The rule that matters: data quality > data quantity. 500 hand-curated examples beat 50,000 noisy ones.

Format example for instruction tuning:

{"instruction": "Classify this support ticket into a category.",
 "input": "I was charged twice for May.",
 "output": "billing"}

Or for chat-style:

{"messages": [
  {"role": "system", "content": "You triage support tickets..."},
  {"role": "user", "content": "I was charged twice for May."},
  {"role": "assistant", "content": "billing"}
]}

Match the format the base model expects (Llama / Qwen / Mistral all have specific chat templates).

Axolotl config

base_model: meta-llama/Llama-3.1-8B-Instruct
load_in_4bit: true                   # QLoRA
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - o_proj

datasets:
  - path: ./data/triage.jsonl
    type: chat_template

micro_batch_size: 4
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
warmup_ratio: 0.1

output_dir: ./ckpt
accelerate launch -m axolotl.cli.train config.yaml

A 8B model + 1k examples + 3 epochs = ~30 min on a single A100. Cost: ~$5.

Evaluation

Always:

  1. Hold out 20% of training data; never train on it.
  2. Score the held-out with an eval framework .
  3. Compare to base model (without fine-tune).
  4. Test on real production samples (not just held-out).

If fine-tuned doesn’t beat base on held-out: stop. Tune training, not deployment.

Serving

Two options:

Inference server

Deploy via vLLM with the LoRA adapter:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules my-adapter=/path/to/adapter

Switch adapters per request. One base; many adapters; cheap multi-task serving.

Hosted

Together AI / Modal / Anyscale serve fine-tuned models with adapters per-request. Pay per token; no GPU ops.

Cost math

Training cost: typically $5–500 for LoRA on a small base. Cheap.

Inference cost (running): half-to-tenth of frontier model price. The savings compound with volume.

A team I worked with replaced Sonnet ($3/MTok input) on a high-volume classifier with a fine-tuned Llama 8B served via vLLM ($0.30/MTok equivalent). Saved ~$8k/month on a single feature. Training cost was $50.

Common mistakes

1. Fine-tuning when prompting would do

Always exhaust prompting + RAG + routing first. Fine-tune is the expensive last resort.

2. Bad eval set

Held-out includes data leaked from training. Scores look great; production fails.

3. No comparison to base

Maybe the base model with better prompting is just as good. Always compare.

4. Over-fitting on edge cases

Training set has 5 examples of an edge case. Model memorizes them. Real production has different edge cases. Distribution mismatch.

5. Forgetting the base model updates

Llama 3.3 ships; your fine-tune is on 3.1. You don’t get the upgrade. Plan re-fine-tune cadence.

What I’d ship today

For a team considering fine-tuning:

  1. Verify prompting + RAG + routing exhausted.
  2. Curate 500–2000 quality examples of the target task.
  3. QLoRA on Llama 3.1 8B or Qwen 2.5 7B via Axolotl.
  4. Eval on held-out + production samples.
  5. Serve with vLLM + adapter.
  6. Compare cost vs Sonnet on the same volume.
  7. If wins on cost AND quality: ship.

Read this next

If you want my Axolotl config + eval harness, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .