AI/LLM Cheatsheet 10 — Evaluation

LLM evaluation.

Why eval

LLMs are non-deterministic. Without eval, you can’t measure improvements / regressions.

Golden dataset

Curate 50-500 input → expected output pairs:

dataset = [
    {"input": "What is X?", "expected": "X is Y."},
    ...
]

Run model, score outputs.

Scoring methods

Exact match: rigid.
Embedding similarity: cosine.
String similarity: edit distance, ROUGE, BLEU.
LLM-as-judge: another LLM scores.
Human eval: gold standard.

LLM-as-judge

prompt = f"""
Question: {q}
Reference answer: {expected}
Candidate answer: {actual}

Rate the candidate 1-5 on accuracy and helpfulness.
Output JSON: {{"accuracy": N, "helpfulness": N, "reason": "..."}}
"""

scores = judge_llm(prompt)

Use stronger model as judge.

Frameworks

Ragas (RAG)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])

Promptfoo

prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-5, anthropic:claude-opus-4-7]
tests:
  - vars: { topic: "AI" }
    assert:
      - type: contains
        value: "artificial intelligence"
      - type: llm-rubric
        value: "Response is accurate and concise"

promptfoo eval

DeepEval (pytest-style)

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric

def test_relevance():
    test_case = LLMTestCase(input="What is X?", actual_output=actual)
    assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])

Custom eval loop

def evaluate():
    results = []
    for case in dataset:
        actual = run_model(case["input"])
        score = judge(case["expected"], actual)
        results.append({"input": case["input"], "score": score, "actual": actual})
    
    avg = sum(r["score"] for r in results) / len(results)
    return avg, results

Metrics by use case

Q&A: exact match (factual), LLM-judge (open).
Summarization: ROUGE, LLM-judge.
Translation: BLEU + LLM-judge.
RAG: faithfulness, recall.
Code gen: pass rate on test suite.
Classification: accuracy, F1.

Track over time

Store evals with model + prompt version:

{
  "run_id": "...",
  "model": "gpt-5",
  "prompt_version": "v2",
  "avg_score": 0.87,
  "per_case": [...]
}

Dashboard regression alert.

A/B testing prompts

def ab(prompt_a, prompt_b, dataset):
    results_a = [(run(p, prompt_a), c["expected"]) for c in dataset]
    results_b = [(run(p, prompt_b), c["expected"]) for c in dataset]
    return judge_compare(results_a, results_b)

CI integration

- name: Eval
  run: |
    python eval.py --baseline baseline.json --output current.json
    python compare.py baseline.json current.json --threshold 0.05

Fail PR if regression > 5%.

Production monitoring

Log inputs + outputs (sample if heavy).
User feedback (thumbs up/down).
Automated checks (length, format, refusals).
Drift: distribution of outputs over time.

Tools: LangSmith, Helicone, OpenLLMetry, Phoenix (Arize).

Synthetic data

Generate evals with LLM, validate sample manually:

synth = llm(f"""
Generate 5 Q&A pairs about Python:
{output_format}
""")

Adversarial testing

Prompt injection attempts.
PII leaks (give fake PII; check output).
Jailbreaks.
Out-of-domain inputs.

Run as part of eval suite.

Common mistakes

No eval at all (most teams).
Single-metric eval — capture multiple aspects.
Dataset too small (n=10) → noisy.
Judge using same model — incestuous bias.
Treating LLM-as-judge as ground truth.

Why eval#

Golden dataset#

Scoring methods#

LLM-as-judge#

Frameworks#

Ragas (RAG)#

Promptfoo#

DeepEval (pytest-style)#

Custom eval loop#

Metrics by use case#

Track over time#

A/B testing prompts#

CI integration#

Production monitoring#

Synthetic data#

Adversarial testing#

Common mistakes#

Read this next#