Three knobs for an LLM application: prompting, retrieval, and fine-tuning. Most teams pick wrong on day one — fine-tune when prompting would do, RAG when fine-tuning would dominate, prompt when the data they need is in a database three feet away. This post is the working decision guide.

The three approaches

What it doesChanges weightsDeploy speed
PromptingGive the model instructions/examplesNoInstant
RAGInject relevant data into the promptNoFast
Fine-tuningAdjust model weights to bias behaviorYesSlow (hours/days)

They’re not exclusive. Most production AI systems use a combination.

Default: prompting

Start with the system prompt. Always. Most people skip past this and don’t realize how much you can do with a well-structured prompt:

  • Role-led instructions (“You are a triage assistant…”)
  • Examples (few-shot)
  • Output format constraints (JSON via tool calling, see Prompt Engineering Patterns )
  • Chain-of-thought or thinking parameter
  • Tagged inputs to prevent injection

Modern models — Claude 4.7 Opus, GPT-5, Gemini 2.5 — are remarkably steerable. Many tasks that previously required fine-tuning are now solved with a 200-line system prompt and 5 examples.

Use prompting when:

  • The task is well-described and bounded.
  • Examples can carry the behavior.
  • Volume is moderate.
  • The data the model needs already fits in context.

Add RAG when facts change

Prompting tells the model how to behave. RAG tells it what to know.

Use RAG when:

  • The data changes faster than fine-tuning cycles. Yesterday’s docs vs today’s docs.
  • The data is too large to fit in context. Even a 1M-context model can’t carry your 50k-document corpus on every request.
  • You need citations (RAG gives you per-chunk provenance).
  • The data must be updatable in real time (a new product launches → ask about it the same minute).

The canonical pattern: chunk → embed → store in pgvector → retrieve → prompt. See Build a Production RAG App with pgvector and FastAPI for a complete, working example.

RAG covers ~80% of “knowledge-grounded LLM” use cases. Fine-tuning is rarely a substitute for RAG; usually a complement.

Add fine-tuning when

Fine-tuning earns its cost when:

1. Style / format the model resists

The model “gets it right” 80% of the time but you need 99%. Examples:

  • Strict JSON-only output (no preambles, no apologies, no markdown).
  • Domain-specific terminology you want consistent.
  • A specific writing voice (legal contracts, medical reports, brand voice).

Five-shot prompting may get you 95%. Fine-tuning gets you 99%+.

2. High-volume narrow tasks

You have one task that runs 10M times/day. Fine-tuning a smaller model:

  • 7B fine-tune of Llama 3 vs 70B Llama prompting → similar quality on the narrow task.
  • 10× cheaper inference, 5× lower latency.

This is the pattern at companies serving classification, extraction, or routing at scale.

3. Domain knowledge that’s too pervasive for RAG

Some knowledge is every-token knowledge — like medical terminology in a clinical assistant, or legal phrasing in a contract reviewer. RAG can supply concepts, but the model needs to think in the domain. Fine-tuning embeds the domain into the model’s vocabulary.

4. Function-calling / structured output reliability

For very tight schemas, fine-tuning can dramatically improve adherence beyond what prompt engineering achieves. Especially with smaller models.

When NOT to fine-tune

Skip fine-tuning if:

  • Data changes weekly. You’ll be retraining constantly.
  • You don’t have the data. Fine-tuning needs hundreds to thousands of high-quality examples. Bad data → bad fine-tune.
  • You need broad reasoning. Fine-tuning narrows. A Llama fine-tuned for legal contracts is worse at writing code.
  • Cost of error is high and you have no eval set. A fine-tuned model that’s subtly wrong is harder to fix than a prompt change.

The hybrid pattern (most production systems)

In 2026, the production shape that ships looks like this:

                     ┌────────────────────┐
                     │  Fine-tuned model  │  ← style, format, domain voice
                     │  (small, fast)     │
                     └─────────┬──────────┘
                       ┌───────▼────────┐
                       │   RAG context  │  ← facts, fresh data
                       └───────┬────────┘
                       ┌───────▼────────┐
                       │  System prompt │  ← rules, guardrails
                       └────────────────┘

The fine-tune handles “how”; RAG handles “what”; the prompt handles “the line you don’t cross.”

Cost math

Rough numbers for 2026 (will move):

Setup costPer-request costTime-to-deploy
Prompting (frontier)~$0High ($0.001–0.05)Minutes
RAG$1k–$10k (embed corpus)Mid ($0.001–0.01)Days
Fine-tuning (LoRA, 7B)$50–$500Low ($0.0001–0.001)Day or two
Fine-tuning (full, 70B)$5k–$50kMid ($0.0005–0.005)Week

If you’re running 10M req/month at $0.005 each = $50k/month. A $1k fine-tune that drops cost to $0.0005 saves $45k/month. The math closes the case fast.

If you’re running 10k req/month, the same fine-tune savings is $45/month. Not worth the effort.

Decide by volume. Below ~1M req/month on a narrow task, prompting + RAG beats fine-tuning on pure economics.

LoRA — the practical fine-tune in 2026

You’re rarely doing full fine-tuning. LoRA (Low-Rank Adaptation) trains a small set of adapter weights on top of a frozen base model:

  • 0.1–1% of full-tune compute.
  • Adapter weights are tiny (~MB), shipped separately from the base model.
  • One base model, many LoRAs — you can hot-swap adapters per request for multi-task setups.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
config = LoraConfig(
    r=16,                            # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base, config)
# train as usual, save adapter only

For most production fine-tunes in 2026, start with LoRA on a 7–14B base. Full fine-tunes are reserved for cases where you need every bit of quality.

Self-hosted vs API fine-tuning

Self-hostedAPI
ProviderDIY on GPUs / Together / AnyscaleOpenAI, Anthropic (limited), Google, Cohere
ControlFullLimited
Cost (per training)$50–$500$50–$5000
Cost (per inference)LowHigher
PrivacyIn your VPCProvider’s
Time to first inferenceHoursHours

Self-hosted via vLLM once you have volume. API fine-tunes (notably OpenAI’s) are right for low-volume, no-ops cases.

Data quality — the invisible variable

A fine-tune is only as good as its data. Three rules:

  1. Quality over quantity. 500 hand-curated examples beat 50,000 noisy ones. Always.
  2. Match the production distribution. If your fine-tune set is “easy” examples but production has hard ones, the fine-tune will fail in production.
  3. Hold out an evaluation set. Same eval set across base, prompted, RAG’d, and fine-tuned models. The eval set is the source of truth. See LLM Evaluations .

If you don’t have eval data, build it before you build a fine-tune. Otherwise you’re flying blind.

A real decision tree

For each LLM use case in your app:

  1. Try the frontier model with a clean prompt. Score on your eval set.
  2. If quality < target: add few-shot examples, output anchors, role separation. Re-score.
  3. If you need fresh facts: add RAG. Re-score.
  4. If quality still < target on >5% of cases: consider fine-tuning. But also consider whether your eval is calibrated correctly.
  5. If cost > budget at acceptable quality: consider fine-tuning a smaller model. Compare end-to-end cost (training + inference) on a 90-day horizon.
  6. If latency > budget: consider fine-tuning a smaller model AND prompt-caching the frontier baseline.
  7. If everything’s fine but you’re proud of fine-tuning: don’t. The boring stack is the right stack.

Common mistakes

1. Fine-tuning to fix RAG bugs

Your RAG retrieves the wrong chunks. The model answers from the wrong chunks. You fine-tune. Now the model answers from the wrong chunks with conviction. Fix the retrieval, not the model.

2. No eval before fine-tune

You fine-tune. You feel like quality went up. You ship. Two months later you discover quality went down on a category you didn’t think to test. Always eval, always.

3. Overfitting to training data

Models will happily memorize your 500 training examples and parrot them. Hold out 20% as eval. Make sure performance on the held-out set matches the trained set.

4. Underestimating maintenance

A fine-tune isn’t free forever. The base model updates. Your domain shifts. Plan to re-fine-tune every 6–12 months.

5. Conflating “the model is bad” with “my prompt is bad”

I’ve seen teams fine-tune to fix what was a 5-character prompt change. Always sanity-check with a strong prompt before you reach for fine-tuning.

A grounding example

A team I worked with had a triage classifier:

  • v0 (prompt): Frontier model + 3-shot examples. 88% accuracy. $300/day. p95 latency 800ms.
  • v1 (better prompt): Tagged inputs, structured tool output, 5-shot. 93% accuracy. Same cost/latency.
  • v2 (RAG): Add similar-past-tickets via RAG. 95% accuracy. Cost +20%. Latency +200ms.
  • v3 (fine-tune): LoRA on Llama 3.1 8B with 2k labeled examples. 96% accuracy. $30/day. p95 latency 200ms.

v3 ships. v0 was the right starting point. The journey was correct because they measured at every step.

Read this next

If you want a worked-out LoRA fine-tune training pipeline for Llama 3 with eval harness, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .