LLM evaluation.

Why eval

LLMs are non-deterministic. Without eval, you can’t measure improvements / regressions.

Golden dataset

Curate 50-500 input → expected output pairs:

dataset = [
    {"input": "What is X?", "expected": "X is Y."},
    ...
]

Run model, score outputs.

Scoring methods

  • Exact match: rigid.
  • Embedding similarity: cosine.
  • String similarity: edit distance, ROUGE, BLEU.
  • LLM-as-judge: another LLM scores.
  • Human eval: gold standard.

LLM-as-judge

prompt = f"""
Question: {q}
Reference answer: {expected}
Candidate answer: {actual}

Rate the candidate 1-5 on accuracy and helpfulness.
Output JSON: {{"accuracy": N, "helpfulness": N, "reason": "..."}}
"""

scores = judge_llm(prompt)

Use stronger model as judge.

Frameworks

Ragas (RAG)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])

Promptfoo

prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-5, anthropic:claude-opus-4-7]
tests:
  - vars: { topic: "AI" }
    assert:
      - type: contains
        value: "artificial intelligence"
      - type: llm-rubric
        value: "Response is accurate and concise"
promptfoo eval

DeepEval (pytest-style)

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric

def test_relevance():
    test_case = LLMTestCase(input="What is X?", actual_output=actual)
    assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])

Custom eval loop

def evaluate():
    results = []
    for case in dataset:
        actual = run_model(case["input"])
        score = judge(case["expected"], actual)
        results.append({"input": case["input"], "score": score, "actual": actual})
    
    avg = sum(r["score"] for r in results) / len(results)
    return avg, results

Metrics by use case

  • Q&A: exact match (factual), LLM-judge (open).
  • Summarization: ROUGE, LLM-judge.
  • Translation: BLEU + LLM-judge.
  • RAG: faithfulness, recall.
  • Code gen: pass rate on test suite.
  • Classification: accuracy, F1.

Track over time

Store evals with model + prompt version:

{
  "run_id": "...",
  "model": "gpt-5",
  "prompt_version": "v2",
  "avg_score": 0.87,
  "per_case": [...]
}

Dashboard regression alert.

A/B testing prompts

def ab(prompt_a, prompt_b, dataset):
    results_a = [(run(p, prompt_a), c["expected"]) for c in dataset]
    results_b = [(run(p, prompt_b), c["expected"]) for c in dataset]
    return judge_compare(results_a, results_b)

CI integration

- name: Eval
  run: |
    python eval.py --baseline baseline.json --output current.json
    python compare.py baseline.json current.json --threshold 0.05

Fail PR if regression > 5%.

Production monitoring

  • Log inputs + outputs (sample if heavy).
  • User feedback (thumbs up/down).
  • Automated checks (length, format, refusals).
  • Drift: distribution of outputs over time.

Tools: LangSmith, Helicone, OpenLLMetry, Phoenix (Arize).

Synthetic data

Generate evals with LLM, validate sample manually:

synth = llm(f"""
Generate 5 Q&A pairs about Python:
{output_format}
""")

Adversarial testing

  • Prompt injection attempts.
  • PII leaks (give fake PII; check output).
  • Jailbreaks.
  • Out-of-domain inputs.

Run as part of eval suite.

Common mistakes

  • No eval at all (most teams).
  • Single-metric eval — capture multiple aspects.
  • Dataset too small (n=10) → noisy.
  • Judge using same model — incestuous bias.
  • Treating LLM-as-judge as ground truth.

Read this next

If you want my eval harness, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .