Should I fine-tune or use RAG for my use case?

Default to RAG. Fine-tuning earns its cost when you need a specific style/format the model can't follow, when you have many similar tasks at high volume, or when you need a smaller model to match a larger one's quality on a narrow domain. Otherwise RAG is faster, cheaper, and more updatable.

Can I combine fine-tuning and RAG?

Yes — and this is increasingly the production pattern. Fine-tune for style, format, and domain vocabulary; RAG for facts that change. The fine-tune teaches the model how to behave; RAG provides what it should know.

Is fine-tuning worth it for cost reduction?

Often yes for high-volume, narrow tasks — fine-tuning a 7–14B model can match a 70B+ model's quality on your specific task at a fraction of the inference cost. Run the math on your token volumes before committing.

Fine-Tuning vs RAG vs Prompting in 2026 — How to Pick the Right Approach

Three knobs for an LLM application: prompting, retrieval, and fine-tuning. Most teams pick wrong on day one — fine-tune when prompting would do, RAG when fine-tuning would dominate, prompt when the data they need is in a database three feet away. This post is the working decision guide.

The three approaches

	What it does	Changes weights	Deploy speed
Prompting	Give the model instructions/examples	No	Instant
RAG	Inject relevant data into the prompt	No	Fast
Fine-tuning	Adjust model weights to bias behavior	Yes	Slow (hours/days)

They’re not exclusive. Most production AI systems use a combination.

Default: prompting

Start with the system prompt. Always. Most people skip past this and don’t realize how much you can do with a well-structured prompt:

Role-led instructions (“You are a triage assistant…”)
Examples (few-shot)
Output format constraints (JSON via tool calling, see Prompt Engineering Patterns )
Chain-of-thought or thinking parameter
Tagged inputs to prevent injection

Modern models — Claude 4.7 Opus, GPT-5, Gemini 2.5 — are remarkably steerable. Many tasks that previously required fine-tuning are now solved with a 200-line system prompt and 5 examples.

Use prompting when:

The task is well-described and bounded.
Examples can carry the behavior.
Volume is moderate.
The data the model needs already fits in context.

Add RAG when facts change

Prompting tells the model how to behave. RAG tells it what to know.

Use RAG when:

The data changes faster than fine-tuning cycles. Yesterday’s docs vs today’s docs.
The data is too large to fit in context. Even a 1M-context model can’t carry your 50k-document corpus on every request.
You need citations (RAG gives you per-chunk provenance).
The data must be updatable in real time (a new product launches → ask about it the same minute).

The canonical pattern: chunk → embed → store in pgvector → retrieve → prompt. See Build a Production RAG App with pgvector and FastAPI for a complete, working example.

RAG covers ~80% of “knowledge-grounded LLM” use cases. Fine-tuning is rarely a substitute for RAG; usually a complement.

Add fine-tuning when

Fine-tuning earns its cost when:

1. Style / format the model resists

The model “gets it right” 80% of the time but you need 99%. Examples:

Strict JSON-only output (no preambles, no apologies, no markdown).
Domain-specific terminology you want consistent.
A specific writing voice (legal contracts, medical reports, brand voice).

Five-shot prompting may get you 95%. Fine-tuning gets you 99%+.

2. High-volume narrow tasks

You have one task that runs 10M times/day. Fine-tuning a smaller model:

7B fine-tune of Llama 3 vs 70B Llama prompting → similar quality on the narrow task.
10× cheaper inference, 5× lower latency.

This is the pattern at companies serving classification, extraction, or routing at scale.

3. Domain knowledge that’s too pervasive for RAG

Some knowledge is every-token knowledge — like medical terminology in a clinical assistant, or legal phrasing in a contract reviewer. RAG can supply concepts, but the model needs to think in the domain. Fine-tuning embeds the domain into the model’s vocabulary.

4. Function-calling / structured output reliability

For very tight schemas, fine-tuning can dramatically improve adherence beyond what prompt engineering achieves. Especially with smaller models.

When NOT to fine-tune

Skip fine-tuning if:

Data changes weekly. You’ll be retraining constantly.
You don’t have the data. Fine-tuning needs hundreds to thousands of high-quality examples. Bad data → bad fine-tune.
You need broad reasoning. Fine-tuning narrows. A Llama fine-tuned for legal contracts is worse at writing code.
Cost of error is high and you have no eval set. A fine-tuned model that’s subtly wrong is harder to fix than a prompt change.

The hybrid pattern (most production systems)

In 2026, the production shape that ships looks like this:

                     ┌────────────────────┐
                     │  Fine-tuned model  │  ← style, format, domain voice
                     │  (small, fast)     │
                     └─────────┬──────────┘
                               │
                       ┌───────▼────────┐
                       │   RAG context  │  ← facts, fresh data
                       └───────┬────────┘
                               │
                       ┌───────▼────────┐
                       │  System prompt │  ← rules, guardrails
                       └────────────────┘

The fine-tune handles “how”; RAG handles “what”; the prompt handles “the line you don’t cross.”

Cost math

Rough numbers for 2026 (will move):

	Setup cost	Per-request cost	Time-to-deploy
Prompting (frontier)	~$0	High ($0.001–0.05)	Minutes
RAG	$1k–$10k (embed corpus)	Mid ($0.001–0.01)	Days
Fine-tuning (LoRA, 7B)	$50–$500	Low ($0.0001–0.001)	Day or two
Fine-tuning (full, 70B)	$5k–$50k	Mid ($0.0005–0.005)	Week

If you’re running 10M req/month at $0.005 each = $50k/month. A $1k fine-tune that drops cost to $0.0005 saves $45k/month. The math closes the case fast.

If you’re running 10k req/month, the same fine-tune savings is $45/month. Not worth the effort.

Decide by volume. Below ~1M req/month on a narrow task, prompting + RAG beats fine-tuning on pure economics.

LoRA — the practical fine-tune in 2026

You’re rarely doing full fine-tuning. LoRA (Low-Rank Adaptation) trains a small set of adapter weights on top of a frozen base model:

0.1–1% of full-tune compute.
Adapter weights are tiny (~MB), shipped separately from the base model.
One base model, many LoRAs — you can hot-swap adapters per request for multi-task setups.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
config = LoraConfig(
    r=16,                            # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base, config)
# train as usual, save adapter only

For most production fine-tunes in 2026, start with LoRA on a 7–14B base. Full fine-tunes are reserved for cases where you need every bit of quality.

Self-hosted vs API fine-tuning

	Self-hosted	API
Provider	DIY on GPUs / Together / Anyscale	OpenAI, Anthropic (limited), Google, Cohere
Control	Full	Limited
Cost (per training)	$50–$500	$50–$5000
Cost (per inference)	Low	Higher
Privacy	In your VPC	Provider’s
Time to first inference	Hours	Hours

Self-hosted via vLLM once you have volume. API fine-tunes (notably OpenAI’s) are right for low-volume, no-ops cases.

Data quality — the invisible variable

A fine-tune is only as good as its data. Three rules:

Quality over quantity. 500 hand-curated examples beat 50,000 noisy ones. Always.
Match the production distribution. If your fine-tune set is “easy” examples but production has hard ones, the fine-tune will fail in production.
Hold out an evaluation set. Same eval set across base, prompted, RAG’d, and fine-tuned models. The eval set is the source of truth. See LLM Evaluations .

If you don’t have eval data, build it before you build a fine-tune. Otherwise you’re flying blind.

A real decision tree

For each LLM use case in your app:

Try the frontier model with a clean prompt. Score on your eval set.
If quality < target: add few-shot examples, output anchors, role separation. Re-score.
If you need fresh facts: add RAG. Re-score.
If quality still < target on >5% of cases: consider fine-tuning. But also consider whether your eval is calibrated correctly.
If cost > budget at acceptable quality: consider fine-tuning a smaller model. Compare end-to-end cost (training + inference) on a 90-day horizon.
If latency > budget: consider fine-tuning a smaller model AND prompt-caching the frontier baseline.
If everything’s fine but you’re proud of fine-tuning: don’t. The boring stack is the right stack.

Common mistakes

1. Fine-tuning to fix RAG bugs

Your RAG retrieves the wrong chunks. The model answers from the wrong chunks. You fine-tune. Now the model answers from the wrong chunks with conviction. Fix the retrieval, not the model.

2. No eval before fine-tune

You fine-tune. You feel like quality went up. You ship. Two months later you discover quality went down on a category you didn’t think to test. Always eval, always.

3. Overfitting to training data

Models will happily memorize your 500 training examples and parrot them. Hold out 20% as eval. Make sure performance on the held-out set matches the trained set.

4. Underestimating maintenance

A fine-tune isn’t free forever. The base model updates. Your domain shifts. Plan to re-fine-tune every 6–12 months.

5. Conflating “the model is bad” with “my prompt is bad”

I’ve seen teams fine-tune to fix what was a 5-character prompt change. Always sanity-check with a strong prompt before you reach for fine-tuning.

A grounding example

A team I worked with had a triage classifier:

v0 (prompt): Frontier model + 3-shot examples. 88% accuracy. $300/day. p95 latency 800ms.
v1 (better prompt): Tagged inputs, structured tool output, 5-shot. 93% accuracy. Same cost/latency.
v2 (RAG): Add similar-past-tickets via RAG. 95% accuracy. Cost +20%. Latency +200ms.
v3 (fine-tune): LoRA on Llama 3.1 8B with 2k labeled examples. 96% accuracy. $30/day. p95 latency 200ms.

v3 ships. v0 was the right starting point. The journey was correct because they measured at every step.

Read this next

If you want a worked-out LoRA fine-tune training pipeline for Llama 3 with eval harness, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The three approaches#

Default: prompting#

Add RAG when facts change#

Add fine-tuning when#

1. Style / format the model resists#

2. High-volume narrow tasks#

3. Domain knowledge that’s too pervasive for RAG#

4. Function-calling / structured output reliability#

When NOT to fine-tune#

The hybrid pattern (most production systems)#

Cost math#

LoRA — the practical fine-tune in 2026#

Self-hosted vs API fine-tuning#

Data quality — the invisible variable#

A real decision tree#

Common mistakes#

1. Fine-tuning to fix RAG bugs#

2. No eval before fine-tune#

3. Overfitting to training data#

4. Underestimating maintenance#

5. Conflating “the model is bad” with “my prompt is bad”#

A grounding example#

Read this next#