Do I need a framework or roll my own?

Roll your own for the first 30 cases — plain Python + JSON. Move to a framework when you have multiple judges, want side-by-side compares, or need persistent eval history. Frameworks earn their cost above 100 cases.

Best for RAG specifically?

Ragas. Ships faithfulness, answer-relevance, context-precision, context-recall metrics out of the box. Other frameworks support these too but Ragas is opinionated for RAG.

LLM Evaluation Frameworks in 2026 — Braintrust, LangSmith, Ragas, DeepEval

LLM evals matter more than the model choice for production quality. The 2026 framework landscape clarified; picking right saves months. This post is the comparison.

The landscape

	Best for	Strengths
Braintrust	Eval-driven workflows	Ship-eval-with-code; great compare UI
LangSmith	LangChain shops	Tight integration; trace + eval combined
Ragas	RAG-specific	Pre-built RAG metrics
DeepEval	pytest-style	Familiar to Python testers
Phoenix (Arize)	Trace + eval combined	Open-source; flexible
Promptfoo	CLI-style	Easy CI integration
Custom (Python)	Below 30 cases	Trivial start

For most teams in 2026: Braintrust for product evals + Ragas for RAG.

Braintrust

from braintrust import init_logger, Eval

Eval(
    name="triage_eval",
    data=lambda: load_cases(),
    task=triage,
    scores=[
        AccuracyScore,
        CategoryF1Score,
    ]
)

Run from CLI; results in dashboard. A/B compare any two runs. Designed for “evals are code” workflow.

LangSmith

If you’re on LangChain / LangGraph, LangSmith is right there:

from langsmith import Client
client = Client()

dataset = client.create_dataset("triage")
for case in cases:
    client.create_example(inputs=case.input, outputs={"expected": case.expected}, dataset_id=dataset.id)

# Run + score
client.evaluate(triage, data=dataset, evaluators=[accuracy_evaluator])

Same UI as your tracing. Single tool. See LLM Observability .

Ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=rag_eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

The four canonical RAG metrics:

Faithfulness: did the answer follow from the context?
Answer relevancy: did it answer the question?
Context precision: was retrieved context useful?
Context recall: was all needed context retrieved?

For RAG specifically Ragas is the right starting point.

DeepEval

from deepeval import assert_test
from deepeval.metrics import GEval

def test_triage():
    assert_test(
        result=triage("..."),
        metric=GEval(criteria="response is concise and correct"),
    )

Looks like pytest. Familiar to teams that already write tests. Runs in CI.

Phoenix

Open-source from Arize. Combines tracing + evaluation. Self-hostable. Solid choice for teams that don’t want yet-another-SaaS.

Promptfoo

CLI-first; YAML config:

prompts: [{ id: triage, content: "Classify: {{input}}" }]
providers: [{ id: claude, config: { model: claude-sonnet-4-6 } }]
tests:
  - vars: { input: "I was charged twice" }
    assert:
      - type: contains
        value: "billing"

Easy CI. Limited visualization vs the SaaS options.

What to evaluate

Per the LLM Evaluations post :

Reference-based (classification, extraction): exact match.
LLM-as-judge for free-form (helpful, on-topic, tone).
RAG-specific: faithfulness, relevance, context quality.
Behavioral: refusal of disallowed; injection-resistance.

Each framework supports these; the difference is ergonomics.

What I’d ship today

For a new LLM product:

Plain Python + 30-case eval set on day one.
Move to Braintrust or LangSmith when team is multiple people running evals weekly.
Add Ragas if you ship RAG.
CI integration that fails PRs that drop accuracy >2pp.

The framework matters less than the discipline of running evals on every change.

Read this next

If you want a Braintrust + Ragas + CI starter, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The landscape#

Braintrust#

LangSmith#

Ragas#

DeepEval#

Phoenix#

Promptfoo#

What to evaluate#

What I’d ship today#

Read this next#