If your RAG system returns “okay” answers and you’re wondering what to upgrade next, the answer is almost always: add a reranker. One extra API call per query. 10–25% quality improvement on most workloads. Best ROI on the entire RAG stack.
Why rerankers work
Vector search is fast but coarse. The embedding model encodes a document and a query into separate vectors and measures distance. It can’t reason about how this specific query relates to this specific document.
A reranker (cross-encoder) takes both the query and the document together as input, runs a transformer that attends across them, and outputs a relevance score. It’s slow per pair — but you only run it on the top 30 candidates from your retriever, not the millions in your corpus.
Query → Embedder → top 100 candidates (fast)
│
▼
Reranker (slower, only on 100)
│
▼
Top 8 reranked → Prompt
The two-stage architecture is fast where it needs to be (initial retrieval over millions) and accurate where it matters (final ranking).
The contenders in 2026
| Quality | Latency | Cost | Self-host | |
|---|---|---|---|---|
| Cohere Rerank-3 | Top-tier | ~50ms / 100 docs | $1 / 1k searches | No |
| Voyage rerank-2 | Top-tier | ~80ms / 100 docs | $0.05 / 1k searches | No |
| BGE-Reranker-v2-M3 | Excellent | ~100ms / 100 docs | self-hosted | Yes |
| Jina Reranker v2 | Strong | ~80ms / 100 docs | API or self-host | Partial |
| Custom cross-encoder (sentence-transformers) | Variable | ~200ms / 100 docs | self-hosted | Yes |
For 2026 production:
- Hosted, easy: Cohere Rerank-3.
- Hosted, cheap: Voyage rerank-2 (often 20× cheaper).
- Self-hosted: BGE-Reranker-v2-M3 on a single GPU.
Wiring one in
import cohere
co = cohere.Client(api_key=os.environ["COHERE_API_KEY"])
async def retrieve(query: str, k: int = 8) -> list[Doc]:
# Stage 1: vector search returns top 100
candidates = await pgvector_search(query, k=100)
# Stage 2: rerank
docs_text = [c.content for c in candidates]
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs_text,
top_n=k,
)
return [candidates[r.index] for r in response.results]
Three lines of business logic. The reranker takes 100 candidates and gives you the top k=8.
For self-hosted with BGE-Reranker:
from FlagEmbedding import FlagReranker
reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)
def rerank(query: str, candidates: list[str], k: int = 8) -> list[int]:
pairs = [[query, c] for c in candidates]
scores = reranker.compute_score(pairs)
indexed = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
return [i for i, _ in indexed[:k]]
Drop a single L4 GPU in front of the inference; you get ~1k rerank calls/sec.
Hybrid + rerank — the production pattern
The full 2026 RAG retrieval pattern:
Query
│
├──▶ Vector search (top 30)
│
└──▶ BM25 / lexical (top 30)
│
▼
RRF fusion (top 100 unique)
│
▼
Reranker (top 8)
│
▼
LLM
Hybrid retrieval provides candidate diversity. The reranker picks the best of them. Together: quality that prompt engineering alone can’t match.
Latency budget
| Stage | Latency |
|---|---|
| Vector search (pgvector HNSW) | 5–20ms |
| BM25 (Postgres FTS) | 5–20ms |
| RRF fusion (in-process) | <1ms |
| Reranker (Cohere) | 50–100ms |
| LLM generate | 500–2000ms |
| Total | ~600–2200ms |
The reranker adds ~100ms. If your end-to-end budget is 2 seconds, that’s 5%. For the quality lift, it’s free.
Quality numbers
Internal benchmarks I’ve run on a domain-specific RAG corpus:
| Setup | Recall@8 | Answer quality (LLM-judged) |
|---|---|---|
| Vector only | 0.71 | 6.8 / 10 |
| Hybrid (vector + BM25) | 0.79 | 7.5 / 10 |
| Hybrid + reranker | 0.91 | 8.7 / 10 |
| Hybrid + reranker + better embedder | 0.93 | 8.9 / 10 |
The reranker delivers more lift than upgrading the embedding model. Both stack — but if you can only do one, do reranker.
For the eval mechanics see LLM Evaluations .
Cost
For 1M RAG queries / month with 100 candidates each:
- Cohere Rerank-3: ~$1k/month.
- Voyage rerank-2: ~$50/month.
- Self-hosted BGE-Reranker on 1× L4: ~$200/month compute.
Compare to the LLM call cost on the same volume ($5k+ typically). The reranker is rounding error on the bill.
When NOT to use a reranker
- Sub-10ms latency budget total (very rare for RAG).
- Top-1 retrieval where ranking among the top doesn’t matter.
- Workloads where the embedder is already 99%+ accurate (almost never true at scale).
For 95% of RAG systems, add a reranker.
Common mistakes
1. Reranking too few candidates
If you only retrieve 8 and rerank 8, the reranker has nothing to fix. Retrieve 30–100, rerank to 8. Give it room.
2. Reranking after the LLM
Reranking exists to inform the LLM. Doing it after defeats the purpose.
3. No batching
Most reranker APIs accept up to 1000 documents per call. Batch — don’t make 100 single-doc requests.
4. Forgetting that reranker = different model
The reranker has a context limit (typically 512 tokens per pair). If your chunks are larger, truncate or use a longer-context reranker.
5. Mixing reranker scores with retrieval scores
Reranker scores are not normalized to vector similarity scores. Don’t average them. The reranker’s order is what you want; that’s the whole signal.
What I’d build today
For a new RAG project in 2026:
- OpenAI text-embedding-3-small for embeddings.
- pgvector with HNSW for the index.
- Postgres FTS for the lexical leg.
- RRF fusion of the two.
- Cohere Rerank-3 OR Voyage rerank-2 as the final stage.
- Top 8 to the LLM.
That’s a SOTA-2026 RAG pipeline. ~80% quality of any closed-source product, all assembled from boring infrastructure.
Read this next
- Build a Production RAG App with pgvector and FastAPI
- Embedding Models in 2026
- pgvector Deep Dive
- LLM Evaluations
If you want a hybrid + rerank pipeline I’ve shipped, with eval harness, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .