Fine-tuning got cheap enough by 2026 that small teams ship fine-tuned models. The mechanics are well-trodden; the success patterns are clear. This post is the working playbook.
When to fine-tune
Don’t, until you’ve tried:
- Prompting + good examples.
- Structured output via tool calling.
- RAG for fresh facts.
- Routing (use bigger model for hard).
After that, fine-tune when:
- A narrow task runs at high volume; you want a smaller cheaper model that matches the big one.
- The model resists a specific style / format that prompting can’t fix.
- You have a domain (legal, medical) where vocabulary matters.
See Fine-Tuning vs RAG vs Prompting .
LoRA / QLoRA
LoRA (Low-Rank Adaptation) trains a small set of adapter weights on top of a frozen base. ~0.1–1% of full-tune compute; comparable quality.
QLoRA quantizes the base model to 4-bit while training adapters. ~70% memory reduction; trains a 70B model on a single H100.
For 2026, QLoRA on a 7B–14B base is the sweet spot for most teams.
Tools
| Strengths | |
|---|---|
| Axolotl | Popular open-source; YAML config |
| Unsloth | Faster training; good ergonomics |
| HuggingFace TRL | Library; build-your-own-trainer |
| Modal Labs | GPU-as-a-service |
| Together AI / Anyscale | Hosted fine-tuning |
For self-hosted: Axolotl + Unsloth + Modal.
Training data
The rule that matters: data quality > data quantity. 500 hand-curated examples beat 50,000 noisy ones.
Format example for instruction tuning:
{"instruction": "Classify this support ticket into a category.",
"input": "I was charged twice for May.",
"output": "billing"}
Or for chat-style:
{"messages": [
{"role": "system", "content": "You triage support tickets..."},
{"role": "user", "content": "I was charged twice for May."},
{"role": "assistant", "content": "billing"}
]}
Match the format the base model expects (Llama / Qwen / Mistral all have specific chat templates).
Axolotl config
base_model: meta-llama/Llama-3.1-8B-Instruct
load_in_4bit: true # QLoRA
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- o_proj
datasets:
- path: ./data/triage.jsonl
type: chat_template
micro_batch_size: 4
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
warmup_ratio: 0.1
output_dir: ./ckpt
accelerate launch -m axolotl.cli.train config.yaml
A 8B model + 1k examples + 3 epochs = ~30 min on a single A100. Cost: ~$5.
Evaluation
Always:
- Hold out 20% of training data; never train on it.
- Score the held-out with an eval framework .
- Compare to base model (without fine-tune).
- Test on real production samples (not just held-out).
If fine-tuned doesn’t beat base on held-out: stop. Tune training, not deployment.
Serving
Two options:
Inference server
Deploy via vLLM with the LoRA adapter:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules my-adapter=/path/to/adapter
Switch adapters per request. One base; many adapters; cheap multi-task serving.
Hosted
Together AI / Modal / Anyscale serve fine-tuned models with adapters per-request. Pay per token; no GPU ops.
Cost math
Training cost: typically $5–500 for LoRA on a small base. Cheap.
Inference cost (running): half-to-tenth of frontier model price. The savings compound with volume.
A team I worked with replaced Sonnet ($3/MTok input) on a high-volume classifier with a fine-tuned Llama 8B served via vLLM ($0.30/MTok equivalent). Saved ~$8k/month on a single feature. Training cost was $50.
Common mistakes
1. Fine-tuning when prompting would do
Always exhaust prompting + RAG + routing first. Fine-tune is the expensive last resort.
2. Bad eval set
Held-out includes data leaked from training. Scores look great; production fails.
3. No comparison to base
Maybe the base model with better prompting is just as good. Always compare.
4. Over-fitting on edge cases
Training set has 5 examples of an edge case. Model memorizes them. Real production has different edge cases. Distribution mismatch.
5. Forgetting the base model updates
Llama 3.3 ships; your fine-tune is on 3.1. You don’t get the upgrade. Plan re-fine-tune cadence.
What I’d ship today
For a team considering fine-tuning:
- Verify prompting + RAG + routing exhausted.
- Curate 500–2000 quality examples of the target task.
- QLoRA on Llama 3.1 8B or Qwen 2.5 7B via Axolotl.
- Eval on held-out + production samples.
- Serve with vLLM + adapter.
- Compare cost vs Sonnet on the same volume.
- If wins on cost AND quality: ship.
Read this next
- Fine-Tuning vs RAG vs Prompting in 2026
- Self-Hosted LLMs in 2026
- LLM Evaluation Frameworks
- LLM Routing in 2026
If you want my Axolotl config + eval harness, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .