What's the most important RAG metric?

Faithfulness — does the answer match the retrieved context, without hallucinating? An accurate answer based on stale/irrelevant context is still a problem; a hallucinated answer is the bigger one.

Manual eval or automated?

Both. Manual on a small golden set (50–200 cases) for ground truth. Automated (LLM-as-judge or rule-based) on production traces for scale. Re-run manual when models change.

Evaluating RAG Systems in 2026 — Retrieval Quality, Faithfulness, and the Metrics That Matter

RAG evaluation is where most teams skip steps and pay for it later. Without evals, you can’t tell if a “fix” actually helps. This post is the working playbook.

What to measure

RAG has two halves; eval them separately.

Retrieval:

Recall@k: of the documents that should be retrieved, what fraction were?
Precision@k: of the documents retrieved, what fraction were relevant?
MRR (Mean Reciprocal Rank): where in the top-k did the right doc appear?

Generation:

Faithfulness: does the answer contain only facts from the context?
Answer relevance: does the answer address the question?
Context precision: was the retrieved context useful?
Helpfulness (subjective): would a user be satisfied?

Golden dataset

Hand-curate 100–500 queries with:

Question.
Expected source documents (retrieval ground truth).
Expected answer (or expected substring / facts).

golden = [
    {
        "question": "What's our refund policy?",
        "must_retrieve": ["refund-policy.md"],
        "must_contain": ["30 days", "no questions asked"],
        "must_not_contain": ["20 days", "case by case"],
    },
    # ...
]

Real production queries beat synthetic. Use last week’s user queries as a starting point.

Retrieval eval

async def eval_retrieval(case):
    retrieved = await rag.retrieve(case["question"], k=5)
    retrieved_ids = {d.id for d in retrieved}
    expected = set(case["must_retrieve"])
    
    recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0
    precision = len(expected & retrieved_ids) / len(retrieved_ids) if retrieved_ids else 0.0
    mrr = mean_reciprocal_rank(retrieved_ids, expected)
    
    return {"recall": recall, "precision": precision, "mrr": mrr}

Aggregate across the golden set. Track over time. See Embeddings & Semantic Search .

Generation eval

Substring check (cheap, fast)

def eval_substring(case, answer):
    contains = all(s in answer for s in case["must_contain"])
    forbidden = any(s in answer for s in case.get("must_not_contain", []))
    return contains and not forbidden

Good for factual answers (“the policy is 30 days”).

LLM-as-judge

JUDGE_PROMPT = """
You are evaluating whether an AI-generated answer is faithful to the provided context.

Context: {context}
Question: {question}
Answer: {answer}

Score:
- 1 if every claim in the answer is supported by the context.
- 0 if the answer contains claims not in the context (hallucination).
"""

async def eval_faithfulness(context, question, answer):
    resp = await client.messages.create(
        model="claude-haiku-4-5",
        tools=[{"name": "score", "input_schema": {"type": "object", "properties": {
            "score": {"type": "number"}, "reason": {"type": "string"}
        }}}],
        tool_choice={"type": "tool", "name": "score"},
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(...)}]
    )
    return parse(resp)

Cheaper model judges; aggregate scores across cases.

Ragas

The popular OSS RAG eval framework:

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)

dataset = Dataset.from_dict({
    "question": questions,
    "answer": answers,
    "contexts": [list_of_retrieved_docs_per_question],
    "ground_truth": expected_answers,
})

result = evaluate(dataset, metrics=[
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
])

Calls LLM judges under the hood; outputs scores per metric. Good starting point.

Automated regression

Every PR / model change runs the eval set:

# CI
async def ci_eval():
    results = []
    for case in golden:
        retrieved = await rag.retrieve(case["question"])
        answer = await rag.answer(case["question"], retrieved)
        score = await eval_case(case, retrieved, answer)
        results.append(score)
    
    avg = sum(r["faithfulness"] for r in results) / len(results)
    if avg < 0.85:
        raise SystemExit(f"Faithfulness regression: {avg}")

Threshold; block merge on regression. See LLM Evaluation .

Production sampling

Sample real queries; evaluate offline:

async def daily_eval():
    sample = await traces.sample(n=200, where="feature='rag'")
    for trace in sample:
        score = await eval_trace(trace)
        await metrics.gauge("rag.faithfulness", score.faithfulness, tags={"date": today})

Trends matter more than absolute scores. Slope catching regressions in production.

Common failure modes to watch

Retrieval misses: the right doc isn’t in top-k. Recall@k too low.
Hallucination on missing context: model invents facts when retrieval fails.
Answer drift: subtle changes to wording over time.
Confidence calibration: model says “I don’t know” too rarely or too often.

Improving each axis

Retrieval:

Larger k.
Hybrid search (BM25 + vector).
Reranker (cross-encoder).
Better chunking.

Generation:

Stronger model.
“Cite the source” pattern.
Refuse-on-low-confidence pattern: “If the context doesn’t contain the answer, say ‘I don’t have that information.’”
Smaller / focused context (fewer chunks → less distraction).

See RAG Patterns .

Common mistakes

1. Eval set built from training data

Leakage. Scores look great; production fails. Build eval from real production queries.

2. Only LLM-as-judge, no human review

LLM judges have blind spots. Periodically sample and review by hand.

3. One metric

Optimizing faithfulness alone → answers become “I don’t know.” Use multiple metrics; weigh them.

4. No eval pipeline in CI

You change retrieval; quality drops; nobody notices for weeks.

5. Tiny eval set

20 cases is noise. Aim for 100–500 representative cases.

What I’d ship today

For a new RAG app:

Build a golden set of 100–200 hand-curated cases.
Ragas + LLM judge for automated metrics.
CI gate on key thresholds.
Production sampling daily.
Quarterly human review of borderline cases.
Track over time; detect regressions on slope.

Read this next

If you want my Ragas + golden-set starter, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What to measure#

Golden dataset#

Retrieval eval#

Generation eval#

Substring check (cheap, fast)#

LLM-as-judge#

Ragas#

Automated regression#

Production sampling#

Common failure modes to watch#

Improving each axis#

Common mistakes#

1. Eval set built from training data#

2. Only LLM-as-judge, no human review#

3. One metric#

4. No eval pipeline in CI#

5. Tiny eval set#

What I’d ship today#

Read this next#