RAG evaluation is where most teams skip steps and pay for it later. Without evals, you can’t tell if a “fix” actually helps. This post is the working playbook.
What to measure
RAG has two halves; eval them separately.
Retrieval:
- Recall@k: of the documents that should be retrieved, what fraction were?
- Precision@k: of the documents retrieved, what fraction were relevant?
- MRR (Mean Reciprocal Rank): where in the top-k did the right doc appear?
Generation:
- Faithfulness: does the answer contain only facts from the context?
- Answer relevance: does the answer address the question?
- Context precision: was the retrieved context useful?
- Helpfulness (subjective): would a user be satisfied?
Golden dataset
Hand-curate 100–500 queries with:
- Question.
- Expected source documents (retrieval ground truth).
- Expected answer (or expected substring / facts).
golden = [
{
"question": "What's our refund policy?",
"must_retrieve": ["refund-policy.md"],
"must_contain": ["30 days", "no questions asked"],
"must_not_contain": ["20 days", "case by case"],
},
# ...
]
Real production queries beat synthetic. Use last week’s user queries as a starting point.
Retrieval eval
async def eval_retrieval(case):
retrieved = await rag.retrieve(case["question"], k=5)
retrieved_ids = {d.id for d in retrieved}
expected = set(case["must_retrieve"])
recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0
precision = len(expected & retrieved_ids) / len(retrieved_ids) if retrieved_ids else 0.0
mrr = mean_reciprocal_rank(retrieved_ids, expected)
return {"recall": recall, "precision": precision, "mrr": mrr}
Aggregate across the golden set. Track over time. See Embeddings & Semantic Search .
Generation eval
Substring check (cheap, fast)
def eval_substring(case, answer):
contains = all(s in answer for s in case["must_contain"])
forbidden = any(s in answer for s in case.get("must_not_contain", []))
return contains and not forbidden
Good for factual answers (“the policy is 30 days”).
LLM-as-judge
JUDGE_PROMPT = """
You are evaluating whether an AI-generated answer is faithful to the provided context.
Context: {context}
Question: {question}
Answer: {answer}
Score:
- 1 if every claim in the answer is supported by the context.
- 0 if the answer contains claims not in the context (hallucination).
"""
async def eval_faithfulness(context, question, answer):
resp = await client.messages.create(
model="claude-haiku-4-5",
tools=[{"name": "score", "input_schema": {"type": "object", "properties": {
"score": {"type": "number"}, "reason": {"type": "string"}
}}}],
tool_choice={"type": "tool", "name": "score"},
messages=[{"role": "user", "content": JUDGE_PROMPT.format(...)}]
)
return parse(resp)
Cheaper model judges; aggregate scores across cases.
Ragas
The popular OSS RAG eval framework:
from ragas import evaluate
from ragas.metrics import (
faithfulness, answer_relevancy,
context_precision, context_recall,
)
dataset = Dataset.from_dict({
"question": questions,
"answer": answers,
"contexts": [list_of_retrieved_docs_per_question],
"ground_truth": expected_answers,
})
result = evaluate(dataset, metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
])
Calls LLM judges under the hood; outputs scores per metric. Good starting point.
Automated regression
Every PR / model change runs the eval set:
# CI
async def ci_eval():
results = []
for case in golden:
retrieved = await rag.retrieve(case["question"])
answer = await rag.answer(case["question"], retrieved)
score = await eval_case(case, retrieved, answer)
results.append(score)
avg = sum(r["faithfulness"] for r in results) / len(results)
if avg < 0.85:
raise SystemExit(f"Faithfulness regression: {avg}")
Threshold; block merge on regression. See LLM Evaluation .
Production sampling
Sample real queries; evaluate offline:
async def daily_eval():
sample = await traces.sample(n=200, where="feature='rag'")
for trace in sample:
score = await eval_trace(trace)
await metrics.gauge("rag.faithfulness", score.faithfulness, tags={"date": today})
Trends matter more than absolute scores. Slope catching regressions in production.
Common failure modes to watch
- Retrieval misses: the right doc isn’t in top-k. Recall@k too low.
- Hallucination on missing context: model invents facts when retrieval fails.
- Answer drift: subtle changes to wording over time.
- Confidence calibration: model says “I don’t know” too rarely or too often.
Improving each axis
Retrieval:
- Larger k.
- Hybrid search (BM25 + vector).
- Reranker (cross-encoder).
- Better chunking.
Generation:
- Stronger model.
- “Cite the source” pattern.
- Refuse-on-low-confidence pattern: “If the context doesn’t contain the answer, say ‘I don’t have that information.’”
- Smaller / focused context (fewer chunks → less distraction).
See RAG Patterns .
Common mistakes
1. Eval set built from training data
Leakage. Scores look great; production fails. Build eval from real production queries.
2. Only LLM-as-judge, no human review
LLM judges have blind spots. Periodically sample and review by hand.
3. One metric
Optimizing faithfulness alone → answers become “I don’t know.” Use multiple metrics; weigh them.
4. No eval pipeline in CI
You change retrieval; quality drops; nobody notices for weeks.
5. Tiny eval set
20 cases is noise. Aim for 100–500 representative cases.
What I’d ship today
For a new RAG app:
- Build a golden set of 100–200 hand-curated cases.
- Ragas + LLM judge for automated metrics.
- CI gate on key thresholds.
- Production sampling daily.
- Quarterly human review of borderline cases.
- Track over time; detect regressions on slope.
Read this next
- RAG Patterns 2026
- Embeddings & Semantic Search 2026
- LLM Evaluation Frameworks 2026
- LLM Observability 2026
If you want my Ragas + golden-set starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .