Three knobs for an LLM application: prompting, retrieval, and fine-tuning. Most teams pick wrong on day one — fine-tune when prompting would do, RAG when fine-tuning would dominate, prompt when the data they need is in a database three feet away. This post is the working decision guide.
The three approaches
| What it does | Changes weights | Deploy speed | |
|---|---|---|---|
| Prompting | Give the model instructions/examples | No | Instant |
| RAG | Inject relevant data into the prompt | No | Fast |
| Fine-tuning | Adjust model weights to bias behavior | Yes | Slow (hours/days) |
They’re not exclusive. Most production AI systems use a combination.
Default: prompting
Start with the system prompt. Always. Most people skip past this and don’t realize how much you can do with a well-structured prompt:
- Role-led instructions (“You are a triage assistant…”)
- Examples (few-shot)
- Output format constraints (JSON via tool calling, see Prompt Engineering Patterns )
- Chain-of-thought or thinking parameter
- Tagged inputs to prevent injection
Modern models — Claude 4.7 Opus, GPT-5, Gemini 2.5 — are remarkably steerable. Many tasks that previously required fine-tuning are now solved with a 200-line system prompt and 5 examples.
Use prompting when:
- The task is well-described and bounded.
- Examples can carry the behavior.
- Volume is moderate.
- The data the model needs already fits in context.
Add RAG when facts change
Prompting tells the model how to behave. RAG tells it what to know.
Use RAG when:
- The data changes faster than fine-tuning cycles. Yesterday’s docs vs today’s docs.
- The data is too large to fit in context. Even a 1M-context model can’t carry your 50k-document corpus on every request.
- You need citations (RAG gives you per-chunk provenance).
- The data must be updatable in real time (a new product launches → ask about it the same minute).
The canonical pattern: chunk → embed → store in pgvector → retrieve → prompt. See Build a Production RAG App with pgvector and FastAPI for a complete, working example.
RAG covers ~80% of “knowledge-grounded LLM” use cases. Fine-tuning is rarely a substitute for RAG; usually a complement.
Add fine-tuning when
Fine-tuning earns its cost when:
1. Style / format the model resists
The model “gets it right” 80% of the time but you need 99%. Examples:
- Strict JSON-only output (no preambles, no apologies, no markdown).
- Domain-specific terminology you want consistent.
- A specific writing voice (legal contracts, medical reports, brand voice).
Five-shot prompting may get you 95%. Fine-tuning gets you 99%+.
2. High-volume narrow tasks
You have one task that runs 10M times/day. Fine-tuning a smaller model:
- 7B fine-tune of Llama 3 vs 70B Llama prompting → similar quality on the narrow task.
- 10× cheaper inference, 5× lower latency.
This is the pattern at companies serving classification, extraction, or routing at scale.
3. Domain knowledge that’s too pervasive for RAG
Some knowledge is every-token knowledge — like medical terminology in a clinical assistant, or legal phrasing in a contract reviewer. RAG can supply concepts, but the model needs to think in the domain. Fine-tuning embeds the domain into the model’s vocabulary.
4. Function-calling / structured output reliability
For very tight schemas, fine-tuning can dramatically improve adherence beyond what prompt engineering achieves. Especially with smaller models.
When NOT to fine-tune
Skip fine-tuning if:
- Data changes weekly. You’ll be retraining constantly.
- You don’t have the data. Fine-tuning needs hundreds to thousands of high-quality examples. Bad data → bad fine-tune.
- You need broad reasoning. Fine-tuning narrows. A Llama fine-tuned for legal contracts is worse at writing code.
- Cost of error is high and you have no eval set. A fine-tuned model that’s subtly wrong is harder to fix than a prompt change.
The hybrid pattern (most production systems)
In 2026, the production shape that ships looks like this:
┌────────────────────┐
│ Fine-tuned model │ ← style, format, domain voice
│ (small, fast) │
└─────────┬──────────┘
│
┌───────▼────────┐
│ RAG context │ ← facts, fresh data
└───────┬────────┘
│
┌───────▼────────┐
│ System prompt │ ← rules, guardrails
└────────────────┘
The fine-tune handles “how”; RAG handles “what”; the prompt handles “the line you don’t cross.”
Cost math
Rough numbers for 2026 (will move):
| Setup cost | Per-request cost | Time-to-deploy | |
|---|---|---|---|
| Prompting (frontier) | ~$0 | High ($0.001–0.05) | Minutes |
| RAG | $1k–$10k (embed corpus) | Mid ($0.001–0.01) | Days |
| Fine-tuning (LoRA, 7B) | $50–$500 | Low ($0.0001–0.001) | Day or two |
| Fine-tuning (full, 70B) | $5k–$50k | Mid ($0.0005–0.005) | Week |
If you’re running 10M req/month at $0.005 each = $50k/month. A $1k fine-tune that drops cost to $0.0005 saves $45k/month. The math closes the case fast.
If you’re running 10k req/month, the same fine-tune savings is $45/month. Not worth the effort.
Decide by volume. Below ~1M req/month on a narrow task, prompting + RAG beats fine-tuning on pure economics.
LoRA — the practical fine-tune in 2026
You’re rarely doing full fine-tuning. LoRA (Low-Rank Adaptation) trains a small set of adapter weights on top of a frozen base model:
- 0.1–1% of full-tune compute.
- Adapter weights are tiny (~MB), shipped separately from the base model.
- One base model, many LoRAs — you can hot-swap adapters per request for multi-task setups.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base, config)
# train as usual, save adapter only
For most production fine-tunes in 2026, start with LoRA on a 7–14B base. Full fine-tunes are reserved for cases where you need every bit of quality.
Self-hosted vs API fine-tuning
| Self-hosted | API | |
|---|---|---|
| Provider | DIY on GPUs / Together / Anyscale | OpenAI, Anthropic (limited), Google, Cohere |
| Control | Full | Limited |
| Cost (per training) | $50–$500 | $50–$5000 |
| Cost (per inference) | Low | Higher |
| Privacy | In your VPC | Provider’s |
| Time to first inference | Hours | Hours |
Self-hosted via vLLM once you have volume. API fine-tunes (notably OpenAI’s) are right for low-volume, no-ops cases.
Data quality — the invisible variable
A fine-tune is only as good as its data. Three rules:
- Quality over quantity. 500 hand-curated examples beat 50,000 noisy ones. Always.
- Match the production distribution. If your fine-tune set is “easy” examples but production has hard ones, the fine-tune will fail in production.
- Hold out an evaluation set. Same eval set across base, prompted, RAG’d, and fine-tuned models. The eval set is the source of truth. See LLM Evaluations .
If you don’t have eval data, build it before you build a fine-tune. Otherwise you’re flying blind.
A real decision tree
For each LLM use case in your app:
- Try the frontier model with a clean prompt. Score on your eval set.
- If quality < target: add few-shot examples, output anchors, role separation. Re-score.
- If you need fresh facts: add RAG. Re-score.
- If quality still < target on >5% of cases: consider fine-tuning. But also consider whether your eval is calibrated correctly.
- If cost > budget at acceptable quality: consider fine-tuning a smaller model. Compare end-to-end cost (training + inference) on a 90-day horizon.
- If latency > budget: consider fine-tuning a smaller model AND prompt-caching the frontier baseline.
- If everything’s fine but you’re proud of fine-tuning: don’t. The boring stack is the right stack.
Common mistakes
1. Fine-tuning to fix RAG bugs
Your RAG retrieves the wrong chunks. The model answers from the wrong chunks. You fine-tune. Now the model answers from the wrong chunks with conviction. Fix the retrieval, not the model.
2. No eval before fine-tune
You fine-tune. You feel like quality went up. You ship. Two months later you discover quality went down on a category you didn’t think to test. Always eval, always.
3. Overfitting to training data
Models will happily memorize your 500 training examples and parrot them. Hold out 20% as eval. Make sure performance on the held-out set matches the trained set.
4. Underestimating maintenance
A fine-tune isn’t free forever. The base model updates. Your domain shifts. Plan to re-fine-tune every 6–12 months.
5. Conflating “the model is bad” with “my prompt is bad”
I’ve seen teams fine-tune to fix what was a 5-character prompt change. Always sanity-check with a strong prompt before you reach for fine-tuning.
A grounding example
A team I worked with had a triage classifier:
- v0 (prompt): Frontier model + 3-shot examples. 88% accuracy. $300/day. p95 latency 800ms.
- v1 (better prompt): Tagged inputs, structured tool output, 5-shot. 93% accuracy. Same cost/latency.
- v2 (RAG): Add similar-past-tickets via RAG. 95% accuracy. Cost +20%. Latency +200ms.
- v3 (fine-tune): LoRA on Llama 3.1 8B with 2k labeled examples. 96% accuracy. $30/day. p95 latency 200ms.
v3 ships. v0 was the right starting point. The journey was correct because they measured at every step.
Read this next
- Build a Production RAG App with pgvector and FastAPI
- LLM Evaluations — Test Prompts and Agents
- Prompt Engineering Patterns That Survive Production
- Self-Hosted LLMs in 2026 — Ollama, vLLM
If you want a worked-out LoRA fine-tune training pipeline for Llama 3 with eval harness, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .