If your RAG system returns “okay” answers and you’re wondering what to upgrade next, the answer is almost always: add a reranker. One extra API call per query. 10–25% quality improvement on most workloads. Best ROI on the entire RAG stack.

Why rerankers work

Vector search is fast but coarse. The embedding model encodes a document and a query into separate vectors and measures distance. It can’t reason about how this specific query relates to this specific document.

A reranker (cross-encoder) takes both the query and the document together as input, runs a transformer that attends across them, and outputs a relevance score. It’s slow per pair — but you only run it on the top 30 candidates from your retriever, not the millions in your corpus.

Query → Embedder → top 100 candidates (fast)
                     Reranker (slower, only on 100)
                    Top 8 reranked → Prompt

The two-stage architecture is fast where it needs to be (initial retrieval over millions) and accurate where it matters (final ranking).

The contenders in 2026

QualityLatencyCostSelf-host
Cohere Rerank-3Top-tier~50ms / 100 docs$1 / 1k searchesNo
Voyage rerank-2Top-tier~80ms / 100 docs$0.05 / 1k searchesNo
BGE-Reranker-v2-M3Excellent~100ms / 100 docsself-hostedYes
Jina Reranker v2Strong~80ms / 100 docsAPI or self-hostPartial
Custom cross-encoder (sentence-transformers)Variable~200ms / 100 docsself-hostedYes

For 2026 production:

  • Hosted, easy: Cohere Rerank-3.
  • Hosted, cheap: Voyage rerank-2 (often 20× cheaper).
  • Self-hosted: BGE-Reranker-v2-M3 on a single GPU.

Wiring one in

import cohere

co = cohere.Client(api_key=os.environ["COHERE_API_KEY"])

async def retrieve(query: str, k: int = 8) -> list[Doc]:
    # Stage 1: vector search returns top 100
    candidates = await pgvector_search(query, k=100)

    # Stage 2: rerank
    docs_text = [c.content for c in candidates]
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs_text,
        top_n=k,
    )

    return [candidates[r.index] for r in response.results]

Three lines of business logic. The reranker takes 100 candidates and gives you the top k=8.

For self-hosted with BGE-Reranker:

from FlagEmbedding import FlagReranker

reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

def rerank(query: str, candidates: list[str], k: int = 8) -> list[int]:
    pairs = [[query, c] for c in candidates]
    scores = reranker.compute_score(pairs)
    indexed = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return [i for i, _ in indexed[:k]]

Drop a single L4 GPU in front of the inference; you get ~1k rerank calls/sec.

Hybrid + rerank — the production pattern

The full 2026 RAG retrieval pattern:

Query
  ├──▶ Vector search (top 30)
  └──▶ BM25 / lexical (top 30)
        RRF fusion (top 100 unique)
         Reranker (top 8)
            LLM

Hybrid retrieval provides candidate diversity. The reranker picks the best of them. Together: quality that prompt engineering alone can’t match.

Latency budget

StageLatency
Vector search (pgvector HNSW)5–20ms
BM25 (Postgres FTS)5–20ms
RRF fusion (in-process)<1ms
Reranker (Cohere)50–100ms
LLM generate500–2000ms
Total~600–2200ms

The reranker adds ~100ms. If your end-to-end budget is 2 seconds, that’s 5%. For the quality lift, it’s free.

Quality numbers

Internal benchmarks I’ve run on a domain-specific RAG corpus:

SetupRecall@8Answer quality (LLM-judged)
Vector only0.716.8 / 10
Hybrid (vector + BM25)0.797.5 / 10
Hybrid + reranker0.918.7 / 10
Hybrid + reranker + better embedder0.938.9 / 10

The reranker delivers more lift than upgrading the embedding model. Both stack — but if you can only do one, do reranker.

For the eval mechanics see LLM Evaluations .

Cost

For 1M RAG queries / month with 100 candidates each:

  • Cohere Rerank-3: ~$1k/month.
  • Voyage rerank-2: ~$50/month.
  • Self-hosted BGE-Reranker on 1× L4: ~$200/month compute.

Compare to the LLM call cost on the same volume ($5k+ typically). The reranker is rounding error on the bill.

When NOT to use a reranker

  • Sub-10ms latency budget total (very rare for RAG).
  • Top-1 retrieval where ranking among the top doesn’t matter.
  • Workloads where the embedder is already 99%+ accurate (almost never true at scale).

For 95% of RAG systems, add a reranker.

Common mistakes

1. Reranking too few candidates

If you only retrieve 8 and rerank 8, the reranker has nothing to fix. Retrieve 30–100, rerank to 8. Give it room.

2. Reranking after the LLM

Reranking exists to inform the LLM. Doing it after defeats the purpose.

3. No batching

Most reranker APIs accept up to 1000 documents per call. Batch — don’t make 100 single-doc requests.

4. Forgetting that reranker = different model

The reranker has a context limit (typically 512 tokens per pair). If your chunks are larger, truncate or use a longer-context reranker.

5. Mixing reranker scores with retrieval scores

Reranker scores are not normalized to vector similarity scores. Don’t average them. The reranker’s order is what you want; that’s the whole signal.

What I’d build today

For a new RAG project in 2026:

  • OpenAI text-embedding-3-small for embeddings.
  • pgvector with HNSW for the index.
  • Postgres FTS for the lexical leg.
  • RRF fusion of the two.
  • Cohere Rerank-3 OR Voyage rerank-2 as the final stage.
  • Top 8 to the LLM.

That’s a SOTA-2026 RAG pipeline. ~80% quality of any closed-source product, all assembled from boring infrastructure.

Read this next

If you want a hybrid + rerank pipeline I’ve shipped, with eval harness, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .