LLM evals matter more than the model choice for production quality. The 2026 framework landscape clarified; picking right saves months. This post is the comparison.
The landscape
| Best for | Strengths | |
|---|---|---|
| Braintrust | Eval-driven workflows | Ship-eval-with-code; great compare UI |
| LangSmith | LangChain shops | Tight integration; trace + eval combined |
| Ragas | RAG-specific | Pre-built RAG metrics |
| DeepEval | pytest-style | Familiar to Python testers |
| Phoenix (Arize) | Trace + eval combined | Open-source; flexible |
| Promptfoo | CLI-style | Easy CI integration |
| Custom (Python) | Below 30 cases | Trivial start |
For most teams in 2026: Braintrust for product evals + Ragas for RAG.
Braintrust
from braintrust import init_logger, Eval
Eval(
name="triage_eval",
data=lambda: load_cases(),
task=triage,
scores=[
AccuracyScore,
CategoryF1Score,
]
)
Run from CLI; results in dashboard. A/B compare any two runs. Designed for “evals are code” workflow.
LangSmith
If you’re on LangChain / LangGraph, LangSmith is right there:
from langsmith import Client
client = Client()
dataset = client.create_dataset("triage")
for case in cases:
client.create_example(inputs=case.input, outputs={"expected": case.expected}, dataset_id=dataset.id)
# Run + score
client.evaluate(triage, data=dataset, evaluators=[accuracy_evaluator])
Same UI as your tracing. Single tool. See LLM Observability .
Ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=rag_eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
The four canonical RAG metrics:
- Faithfulness: did the answer follow from the context?
- Answer relevancy: did it answer the question?
- Context precision: was retrieved context useful?
- Context recall: was all needed context retrieved?
For RAG specifically Ragas is the right starting point.
DeepEval
from deepeval import assert_test
from deepeval.metrics import GEval
def test_triage():
assert_test(
result=triage("..."),
metric=GEval(criteria="response is concise and correct"),
)
Looks like pytest. Familiar to teams that already write tests. Runs in CI.
Phoenix
Open-source from Arize. Combines tracing + evaluation. Self-hostable. Solid choice for teams that don’t want yet-another-SaaS.
Promptfoo
CLI-first; YAML config:
prompts: [{ id: triage, content: "Classify: {{input}}" }]
providers: [{ id: claude, config: { model: claude-sonnet-4-6 } }]
tests:
- vars: { input: "I was charged twice" }
assert:
- type: contains
value: "billing"
Easy CI. Limited visualization vs the SaaS options.
What to evaluate
Per the LLM Evaluations post :
- Reference-based (classification, extraction): exact match.
- LLM-as-judge for free-form (helpful, on-topic, tone).
- RAG-specific: faithfulness, relevance, context quality.
- Behavioral: refusal of disallowed; injection-resistance.
Each framework supports these; the difference is ergonomics.
What I’d ship today
For a new LLM product:
- Plain Python + 30-case eval set on day one.
- Move to Braintrust or LangSmith when team is multiple people running evals weekly.
- Add Ragas if you ship RAG.
- CI integration that fails PRs that drop accuracy >2pp.
The framework matters less than the discipline of running evals on every change.
Read this next
- LLM Evaluations — Test Prompts and Agents
- LLM Observability in 2026
- Build a RAG App with pgvector
- Rerankers in RAG
If you want a Braintrust + Ragas + CI starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .