The embedding model is the most consequential choice in any RAG / semantic-search system. Get it wrong and no amount of prompt engineering downstream rescues you. In 2026 the landscape has matured; this post is the working comparison and decision guide.
The contenders
| Model | Provider | Dim | Cost / 1M tokens | Open? |
|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 (or trunc) | $0.02 | No |
| text-embedding-3-large | OpenAI | 3072 (or trunc) | $0.13 | No |
| voyage-3 | Voyage | 1024 | $0.06 | No |
| voyage-3-large | Voyage | 2048 | $0.18 | No |
| voyage-code-3 | Voyage | 1024 | $0.18 | No |
| Cohere embed-v4 | Cohere | 1536 | $0.05 | No |
| BGE-M3 | BAAI | 1024 | self-hosted | Yes |
| Nomic Embed v2 | Nomic | 768 | self-hosted | Yes |
| jina-embeddings-v3 | Jina | 1024 | self-hosted or API | Partial |
| Snowflake arctic-embed-l | Snowflake | 1024 | self-hosted | Yes |
Quality (rough MTEB ranking)
On the MTEB benchmark (averages, will shift):
- voyage-3-large
- voyage-3
- text-embedding-3-large
- cohere embed-v4
- BGE-M3
- text-embedding-3-small
- Nomic Embed v2
Real-world: differences between the top 5 are small. Differences on your data can flip the order. Always evaluate.
How to pick
1. Start with general purpose
For a typical RAG over English text:
- Hosted, cost-aware:
text-embedding-3-small. Cheap, good, ubiquitous. - Hosted, quality-first:
voyage-3orvoyage-3-large. Top of MTEB, especially for retrieval. - Self-hosted:
BGE-M3(multilingual) orNomic Embed v2.
2. Domain-specialized when possible
- Code:
voyage-code-3(or BGE-Code-V1.5). Trained on code; significantly better than generic models for code search. - Multilingual:
BGE-M3covers 100+ languages. Voyage and Cohere have multilingual variants too. - Legal / medical: Specialized embeddings exist; benchmark vs general purpose. Often the gain is small once you have a good ranker.
3. Match the task
Two RAG sub-tasks:
- Symmetric similarity (find similar items). Most general embeddings work.
- Asymmetric retrieval (find documents matching a question). Use
voyage-3,cohere embed-v4, or models with explicit asymmetric instruction prefixes.
For Build a RAG App , asymmetric is what you want.
Dimensions and Matryoshka
Smaller dimensions = smaller index, faster search, less storage.
# OpenAI: ask for shorter dim directly
client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=1024, # Matryoshka — quality drops gracefully
)
Quality loss at:
- 3072 → 2048: ~1%
- 3072 → 1024: ~2–3%
- 3072 → 512: ~5%
For most RAG, 1024 dim is the sweet spot — much smaller index, marginal quality cost. See pgvector Deep Dive for storage math.
Self-hosting embeddings
For high-volume or privacy-critical:
- BGE-M3 on a single L4 GPU does ~1k embed/sec. $200/month vs $20k/month at OpenAI prices for a billion-token-per-day pipeline.
- Inference servers like vLLM, TEI (Text Embeddings Inference), or sentence-transformers serve them.
Tradeoff: ops, GPU costs at low volume. Below 100M tokens/month, hosted APIs win.
Evaluating on your data
Don’t trust leaderboards. Build your eval:
- 30 queries representative of real usage.
- For each, gold documents that should be retrieved.
- Recall@k for each candidate model.
def recall_at_k(model, queries, k=10):
correct = 0
for q in queries:
docs = retrieve(model, q.text, k=k)
if any(d.id in q.gold_ids for d in docs):
correct += 1
return correct / len(queries)
A 5% recall@10 difference on real data is worth more than 0.5% MTEB difference.
For a deeper eval framework see LLM Evaluations .
Migration
Switching embedding models means re-embedding everything. Plan:
- Dual-store the new column:
ALTER TABLE chunks ADD COLUMN embedding_v2 vector(1024). - Backfill in batches with the new model.
- A/B compare retrieval on a held-out query set.
- Cut over reads to the new column.
- Drop old column after a release cycle of stability.
Don’t try to do this in-place. Keep both for at least one release.
Cost in production
For a SaaS embedding 10M docs/month + 1M queries/month:
- OpenAI 3-small: ~$2k/month total.
- Voyage 3: ~$5k/month.
- BGE-M3 self-hosted on 1× L4: ~$300/month + ops.
For most early-stage products, hosted APIs win on TCO. Self-host when volume justifies it.
Hybrid is mandatory
No matter the embedding model, hybrid retrieval (vector + BM25) outperforms vector-only. See Build a RAG App with pgvector for the RRF fusion pattern.
Common mistakes
1. Picking by leaderboard, not eval
MTEB averages many tasks. Your task may not match the average. Eval.
2. Forgetting to truncate
3072-dim vectors are 12 KB each. For 10M chunks that’s 120 GB. Truncate to 1024 dim ⇒ 40 GB. Often imperceptible quality loss; massive infra savings.
3. Mixing dimensions
embedding vector(1536) -- table schema
client.embeddings.create(model="text-embedding-3-small") -- 1536 ✓
client.embeddings.create(model="text-embedding-3-large") -- 3072 ✗ wrong
Pick a model and stick with it across reads and writes.
4. No periodic re-eval
Embedding models update. Run your eval set when a new version drops. Decide before migrating.
5. Treating embeddings as the whole solution
Embeddings + reranker > embeddings alone. After top-30 from vector search, rerank with Cohere Rerank or BGE-Reranker → top-10. Often a bigger win than chasing better embeddings.
Read this next
- Build a Production RAG App with pgvector and FastAPI
- pgvector Deep Dive — HNSW, IVFFlat, and Tuning
- LLM Evaluations
- Self-Hosted LLMs in 2026 — same pattern for embedding inference.
If you want my embedding-model evaluation harness with MTEB-style metrics on your data, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .