What is a reranker and why does it matter?

A reranker scores each (query, document) pair with a more powerful model than vector similarity, reordering the top candidates from your retriever. Adding one to a RAG pipeline typically improves answer quality 10–25% for one extra API call. Highest-ROI upgrade you can make.

Cohere Rerank or open-source BGE-Reranker?

Cohere Rerank for hosted simplicity and best-in-class quality. BGE-Reranker for self-hosted privacy / cost. Jina Reranker is the third serious option. All three are within 5% of each other on retrieval benchmarks.

Does a reranker replace better embeddings?

It complements them. A weak embedder + reranker often beats a strong embedder alone. Both layers earn their cost; embeddings find the candidate pool, the reranker picks the best of it.

Rerankers in RAG — The Underrated Quality Multiplier in 2026

If your RAG system returns “okay” answers and you’re wondering what to upgrade next, the answer is almost always: add a reranker. One extra API call per query. 10–25% quality improvement on most workloads. Best ROI on the entire RAG stack.

Why rerankers work

Vector search is fast but coarse. The embedding model encodes a document and a query into separate vectors and measures distance. It can’t reason about how this specific query relates to this specific document.

A reranker (cross-encoder) takes both the query and the document together as input, runs a transformer that attends across them, and outputs a relevance score. It’s slow per pair — but you only run it on the top 30 candidates from your retriever, not the millions in your corpus.

Query → Embedder → top 100 candidates (fast)
                          │
                          ▼
                     Reranker (slower, only on 100)
                          │
                          ▼
                    Top 8 reranked → Prompt

The two-stage architecture is fast where it needs to be (initial retrieval over millions) and accurate where it matters (final ranking).

The contenders in 2026

	Quality	Latency	Cost	Self-host
Cohere Rerank-3	Top-tier	~50ms / 100 docs	$1 / 1k searches	No
Voyage rerank-2	Top-tier	~80ms / 100 docs	$0.05 / 1k searches	No
BGE-Reranker-v2-M3	Excellent	~100ms / 100 docs	self-hosted	Yes
Jina Reranker v2	Strong	~80ms / 100 docs	API or self-host	Partial
Custom cross-encoder (sentence-transformers)	Variable	~200ms / 100 docs	self-hosted	Yes

For 2026 production:

Hosted, easy: Cohere Rerank-3.
Hosted, cheap: Voyage rerank-2 (often 20× cheaper).
Self-hosted: BGE-Reranker-v2-M3 on a single GPU.

Wiring one in

import cohere

co = cohere.Client(api_key=os.environ["COHERE_API_KEY"])

async def retrieve(query: str, k: int = 8) -> list[Doc]:
    # Stage 1: vector search returns top 100
    candidates = await pgvector_search(query, k=100)

    # Stage 2: rerank
    docs_text = [c.content for c in candidates]
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs_text,
        top_n=k,
    )

    return [candidates[r.index] for r in response.results]

Three lines of business logic. The reranker takes 100 candidates and gives you the top k=8.

For self-hosted with BGE-Reranker:

from FlagEmbedding import FlagReranker

reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

def rerank(query: str, candidates: list[str], k: int = 8) -> list[int]:
    pairs = [[query, c] for c in candidates]
    scores = reranker.compute_score(pairs)
    indexed = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return [i for i, _ in indexed[:k]]

Drop a single L4 GPU in front of the inference; you get ~1k rerank calls/sec.

Hybrid + rerank — the production pattern

The full 2026 RAG retrieval pattern:

Query
  │
  ├──▶ Vector search (top 30)
  │
  └──▶ BM25 / lexical (top 30)
              │
              ▼
        RRF fusion (top 100 unique)
              │
              ▼
         Reranker (top 8)
              │
              ▼
            LLM

Hybrid retrieval provides candidate diversity. The reranker picks the best of them. Together: quality that prompt engineering alone can’t match.

Latency budget

Stage	Latency
Vector search (pgvector HNSW)	5–20ms
BM25 (Postgres FTS)	5–20ms
RRF fusion (in-process)	<1ms
Reranker (Cohere)	50–100ms
LLM generate	500–2000ms
Total	~600–2200ms

The reranker adds ~100ms. If your end-to-end budget is 2 seconds, that’s 5%. For the quality lift, it’s free.

Quality numbers

Internal benchmarks I’ve run on a domain-specific RAG corpus:

Setup	Recall@8	Answer quality (LLM-judged)
Vector only	0.71	6.8 / 10
Hybrid (vector + BM25)	0.79	7.5 / 10
Hybrid + reranker	0.91	8.7 / 10
Hybrid + reranker + better embedder	0.93	8.9 / 10

The reranker delivers more lift than upgrading the embedding model. Both stack — but if you can only do one, do reranker.

For the eval mechanics see LLM Evaluations .

Cost

For 1M RAG queries / month with 100 candidates each:

Cohere Rerank-3: ~$1k/month.
Voyage rerank-2: ~$50/month.
Self-hosted BGE-Reranker on 1× L4: ~$200/month compute.

Compare to the LLM call cost on the same volume ($5k+ typically). The reranker is rounding error on the bill.

When NOT to use a reranker

Sub-10ms latency budget total (very rare for RAG).
Top-1 retrieval where ranking among the top doesn’t matter.
Workloads where the embedder is already 99%+ accurate (almost never true at scale).

For 95% of RAG systems, add a reranker.

Common mistakes

1. Reranking too few candidates

If you only retrieve 8 and rerank 8, the reranker has nothing to fix. Retrieve 30–100, rerank to 8. Give it room.

2. Reranking after the LLM

Reranking exists to inform the LLM. Doing it after defeats the purpose.

3. No batching

Most reranker APIs accept up to 1000 documents per call. Batch — don’t make 100 single-doc requests.

4. Forgetting that reranker = different model

The reranker has a context limit (typically 512 tokens per pair). If your chunks are larger, truncate or use a longer-context reranker.

5. Mixing reranker scores with retrieval scores

Reranker scores are not normalized to vector similarity scores. Don’t average them. The reranker’s order is what you want; that’s the whole signal.

What I’d build today

For a new RAG project in 2026:

OpenAI text-embedding-3-small for embeddings.
pgvector with HNSW for the index.
Postgres FTS for the lexical leg.
RRF fusion of the two.
Cohere Rerank-3 OR Voyage rerank-2 as the final stage.
Top 8 to the LLM.

That’s a SOTA-2026 RAG pipeline. ~80% quality of any closed-source product, all assembled from boring infrastructure.

Read this next

If you want a hybrid + rerank pipeline I’ve shipped, with eval harness, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Why rerankers work#

The contenders in 2026#

Wiring one in#

Hybrid + rerank — the production pattern#

Latency budget#

Quality numbers#

Cost#

When NOT to use a reranker#

Common mistakes#

1. Reranking too few candidates#

2. Reranking after the LLM#

3. No batching#

4. Forgetting that reranker = different model#

5. Mixing reranker scores with retrieval scores#

What I’d build today#

Read this next#