LLM evaluation.
Why eval
LLMs are non-deterministic. Without eval, you can’t measure improvements / regressions.
Golden dataset
Curate 50-500 input → expected output pairs:
dataset = [
{"input": "What is X?", "expected": "X is Y."},
...
]
Run model, score outputs.
Scoring methods
- Exact match: rigid.
- Embedding similarity: cosine.
- String similarity: edit distance, ROUGE, BLEU.
- LLM-as-judge: another LLM scores.
- Human eval: gold standard.
LLM-as-judge
prompt = f"""
Question: {q}
Reference answer: {expected}
Candidate answer: {actual}
Rate the candidate 1-5 on accuracy and helpfulness.
Output JSON: {{"accuracy": N, "helpfulness": N, "reason": "..."}}
"""
scores = judge_llm(prompt)
Use stronger model as judge.
Frameworks
Ragas (RAG)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
Promptfoo
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-5, anthropic:claude-opus-4-7]
tests:
- vars: { topic: "AI" }
assert:
- type: contains
value: "artificial intelligence"
- type: llm-rubric
value: "Response is accurate and concise"
promptfoo eval
DeepEval (pytest-style)
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
def test_relevance():
test_case = LLMTestCase(input="What is X?", actual_output=actual)
assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])
Custom eval loop
def evaluate():
results = []
for case in dataset:
actual = run_model(case["input"])
score = judge(case["expected"], actual)
results.append({"input": case["input"], "score": score, "actual": actual})
avg = sum(r["score"] for r in results) / len(results)
return avg, results
Metrics by use case
- Q&A: exact match (factual), LLM-judge (open).
- Summarization: ROUGE, LLM-judge.
- Translation: BLEU + LLM-judge.
- RAG: faithfulness, recall.
- Code gen: pass rate on test suite.
- Classification: accuracy, F1.
Track over time
Store evals with model + prompt version:
{
"run_id": "...",
"model": "gpt-5",
"prompt_version": "v2",
"avg_score": 0.87,
"per_case": [...]
}
Dashboard regression alert.
A/B testing prompts
def ab(prompt_a, prompt_b, dataset):
results_a = [(run(p, prompt_a), c["expected"]) for c in dataset]
results_b = [(run(p, prompt_b), c["expected"]) for c in dataset]
return judge_compare(results_a, results_b)
CI integration
- name: Eval
run: |
python eval.py --baseline baseline.json --output current.json
python compare.py baseline.json current.json --threshold 0.05
Fail PR if regression > 5%.
Production monitoring
- Log inputs + outputs (sample if heavy).
- User feedback (thumbs up/down).
- Automated checks (length, format, refusals).
- Drift: distribution of outputs over time.
Tools: LangSmith, Helicone, OpenLLMetry, Phoenix (Arize).
Synthetic data
Generate evals with LLM, validate sample manually:
synth = llm(f"""
Generate 5 Q&A pairs about Python:
{output_format}
""")
Adversarial testing
- Prompt injection attempts.
- PII leaks (give fake PII; check output).
- Jailbreaks.
- Out-of-domain inputs.
Run as part of eval suite.
Common mistakes
- No eval at all (most teams).
- Single-metric eval — capture multiple aspects.
- Dataset too small (n=10) → noisy.
- Judge using same model — incestuous bias.
- Treating LLM-as-judge as ground truth.
Read this next
If you want my eval harness, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .