LLM evals matter more than the model choice for production quality. The 2026 framework landscape clarified; picking right saves months. This post is the comparison.

The landscape

Best forStrengths
BraintrustEval-driven workflowsShip-eval-with-code; great compare UI
LangSmithLangChain shopsTight integration; trace + eval combined
RagasRAG-specificPre-built RAG metrics
DeepEvalpytest-styleFamiliar to Python testers
Phoenix (Arize)Trace + eval combinedOpen-source; flexible
PromptfooCLI-styleEasy CI integration
Custom (Python)Below 30 casesTrivial start

For most teams in 2026: Braintrust for product evals + Ragas for RAG.

Braintrust

from braintrust import init_logger, Eval

Eval(
    name="triage_eval",
    data=lambda: load_cases(),
    task=triage,
    scores=[
        AccuracyScore,
        CategoryF1Score,
    ]
)

Run from CLI; results in dashboard. A/B compare any two runs. Designed for “evals are code” workflow.

LangSmith

If you’re on LangChain / LangGraph, LangSmith is right there:

from langsmith import Client
client = Client()

dataset = client.create_dataset("triage")
for case in cases:
    client.create_example(inputs=case.input, outputs={"expected": case.expected}, dataset_id=dataset.id)

# Run + score
client.evaluate(triage, data=dataset, evaluators=[accuracy_evaluator])

Same UI as your tracing. Single tool. See LLM Observability .

Ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=rag_eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

The four canonical RAG metrics:

  • Faithfulness: did the answer follow from the context?
  • Answer relevancy: did it answer the question?
  • Context precision: was retrieved context useful?
  • Context recall: was all needed context retrieved?

For RAG specifically Ragas is the right starting point.

DeepEval

from deepeval import assert_test
from deepeval.metrics import GEval

def test_triage():
    assert_test(
        result=triage("..."),
        metric=GEval(criteria="response is concise and correct"),
    )

Looks like pytest. Familiar to teams that already write tests. Runs in CI.

Phoenix

Open-source from Arize. Combines tracing + evaluation. Self-hostable. Solid choice for teams that don’t want yet-another-SaaS.

Promptfoo

CLI-first; YAML config:

prompts: [{ id: triage, content: "Classify: {{input}}" }]
providers: [{ id: claude, config: { model: claude-sonnet-4-6 } }]
tests:
  - vars: { input: "I was charged twice" }
    assert:
      - type: contains
        value: "billing"

Easy CI. Limited visualization vs the SaaS options.

What to evaluate

Per the LLM Evaluations post :

  • Reference-based (classification, extraction): exact match.
  • LLM-as-judge for free-form (helpful, on-topic, tone).
  • RAG-specific: faithfulness, relevance, context quality.
  • Behavioral: refusal of disallowed; injection-resistance.

Each framework supports these; the difference is ergonomics.

What I’d ship today

For a new LLM product:

  • Plain Python + 30-case eval set on day one.
  • Move to Braintrust or LangSmith when team is multiple people running evals weekly.
  • Add Ragas if you ship RAG.
  • CI integration that fails PRs that drop accuracy >2pp.

The framework matters less than the discipline of running evals on every change.

Read this next

If you want a Braintrust + Ragas + CI starter, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .