pgvector turns Postgres into a respectable vector database. The defaults are fine for prototypes and bad for production. This post is the missing operator’s manual: how the indexes work, what the knobs do, and the settings that actually matter.
If you want the building side of vector search, see Build a RAG App with pgvector and FastAPI . This post is about the database internals.
The vector type
CREATE EXTENSION vector;
CREATE TABLE items (
id BIGSERIAL PRIMARY KEY,
embedding vector(1536)
);
A vector(d) is a 4 * d byte array of float4 plus a length header. 1536-dim embeddings = ~6 KB per row. A million rows = ~6 GB on disk. Plan accordingly.
Distance operators
| Operator | Distance | Use when |
|---|---|---|
<-> | L2 (Euclidean) | Embeddings already L2-normalized; image features |
<=> | Cosine | Most LLM embeddings (OpenAI, Cohere, Voyage) |
<#> | Negative inner product | Same as cosine when vectors are unit-norm |
<+> | L1 (Manhattan) | Rare; specific feature spaces |
Pick the operator that matches your embedding model’s training objective. OpenAI’s text-embedding-3-* are trained for cosine similarity → use <=>. Mismatching distance and embedding space silently tanks recall.
The two indexes
IVFFlat — fast to build, less recall
CREATE INDEX items_emb_ivfflat
ON items USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
How it works: clusters vectors into lists Voronoi cells. Queries scan probes cells (default 1, set higher for recall).
SET ivfflat.probes = 10; -- query-time recall knob
Tradeoffs:
- Pros: Faster build, smaller index, simpler to reason about.
- Cons: Recall is sensitive to the dataset; needs roughly
sqrt(rows)lists for good defaults; updates can degrade quality over time (re-cluster needed).
I no longer reach for IVFFlat in 2026. HNSW is just better at almost every scale.
HNSW — the default in 2026
CREATE INDEX items_emb_hnsw
ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
How it works: a multi-layer graph where each node has up to m outgoing edges. Queries do greedy search top-down. Closer to a graph algorithm than a clustering algorithm.
Tradeoffs:
- Pros: Excellent recall, no clustering parameter to guess, supports updates without retrain, works at every scale.
- Cons: Slower to build (O(N · ef_construction · m)), bigger index, RAM-hungry (you really want it to fit in
shared_buffers).
Build parameters (HNSW)
m — graph connectivity
Each node has up to m neighbors. Higher m = better recall, bigger index, slower build.
m | When |
|---|---|
| 8 | Memory-constrained, ≤1M vectors, recall not critical |
| 16 | Default, most workloads |
| 32 | Hard recall targets (>0.95 @ k=10), willing to pay 2x build time |
| 48 | Diminishing returns; rarely worth it |
ef_construction — build-time search width
How aggressively neighbors are searched while building. Higher = better recall, much slower build.
ef_construction | When |
|---|---|
| 40 | Quick prototyping |
| 64 | Default |
| 100 | Production, recall matters |
| 200 | Last resort; halves build speed for marginal recall gain |
Build cost grows linearly with both m and ef_construction. A 10M-vector index at m=32, ef_construction=200 can take hours; the same index at m=16, ef_construction=64 builds in tens of minutes.
Query parameters (HNSW)
ef_search — query-time search width
SET hnsw.ef_search = 100; -- per session
Or per-query in pgvector ≥ 0.7 with a hint. The single most important knob in production.
ef_search | Behavior |
|---|---|
| 40 (default) | Fast, lower recall |
| 100 | Sweet spot for k=10–30 retrieval |
| 200+ | Recall priority, latency cost |
Set ef_search ≥ 2 × k as a rule of thumb. If you’re retrieving top 30, use 60+. If you’re rerankering downstream and fetching top 100, use 200.
Bench your real workload
You will be tempted to copy somebody else’s benchmark numbers. Don’t. Your embedding distribution, your filters, your hardware, and your concurrency change everything.
Minimal harness:
# bench.py
import time, statistics, random, asyncpg, asyncio
async def main():
pool = await asyncpg.create_pool(DATABASE_URL, min_size=8, max_size=8)
# Pre-warm the index into memory.
async with pool.acquire() as c:
await c.execute("SET hnsw.ef_search = 100")
await c.execute("SELECT pg_prewarm('items_emb_hnsw')")
queries = load_query_embeddings(500)
async def one(q):
t = time.perf_counter()
async with pool.acquire() as c:
await c.fetch(
"SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 30",
q,
)
return (time.perf_counter() - t) * 1000
# 8 concurrent workers
sem = asyncio.Semaphore(8)
async def gated(q):
async with sem:
return await one(q)
times = await asyncio.gather(*(gated(q) for q in queries))
print(f"p50: {statistics.median(times):.1f}ms "
f"p95: {sorted(times)[int(len(times)*0.95)]:.1f}ms "
f"p99: {sorted(times)[int(len(times)*0.99)]:.1f}ms")
Run it before and after each tuning change. Always at concurrency that matches production.
Recall measurement
Latency is half the picture. Recall is the other half — and it can drop silently if you change ef_search, m, or your data distribution.
-- Ground truth via brute-force scan (drop the index temporarily, or use SET enable_indexscan = off)
SET enable_indexscan = off;
SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 30; -- ground truth
RESET enable_indexscan;
SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 30; -- HNSW
Recall@30 = (intersection size) / 30, averaged over your test queries. Aim for ≥ 0.95 for RAG, ≥ 0.99 for safety-critical retrieval.
Filters with vector search
The killer feature of pgvector vs. dedicated vector DBs is filtering with normal SQL:
SELECT id
FROM items
WHERE tenant_id = $1 AND category = $2
ORDER BY embedding <=> $3
LIMIT 30;
But filtering interacts badly with HNSW. The planner has two paths:
- Pre-filter — run the WHERE first, then sequential-scan + sort by distance. Slow on big tables.
- Post-filter — use the index, then drop rows that don’t match. Can lose results if the filter is selective.
Two strategies that work:
Strategy 1 — Partial indexes per tenant
CREATE INDEX items_emb_tenant_42
ON items USING hnsw (embedding vector_cosine_ops)
WHERE tenant_id = 42;
Best when tenants are large and stable. The planner uses the right index, search stays fast.
Strategy 2 — Iterative scan (pgvector ≥ 0.8)
pgvector 0.8+ added iterative HNSW scans that play nicely with filters. The planner walks the index returning candidates that match the filter, refilling as needed.
SET hnsw.iterative_scan = strict_order;
-- or 'relaxed_order' if you want fewer guarantees but lower latency
This dramatically improves filtered search recall in 2026. Use it.
Storage and memory
- One HNSW edge per neighbor:
m× 4 bytes per node, plus the vector itself. - A 1M-vector, 1536-dim,
m=16HNSW index ≈ 6 GB vectors + 60 MB graph metadata. - For best latency, both must fit in
shared_buffersor the OS page cache.
# postgresql.conf — sane defaults for a vector workload on a 32 GB box
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 4GB # speeds up index builds dramatically
work_mem = 32MB
maintenance_work_mem is the magic one for build speed. The default 64 MB is laughable for an HNSW index of any size. Bump to multiple GB during build, set back after.
Quantization (pgvector 0.7+)
For very large datasets, you can store vectors at lower precision:
-- Half-precision (float16) — 50% storage, ~no recall loss for most embeddings
CREATE INDEX items_emb_half
ON items USING hnsw ((embedding::halfvec(1536)) halfvec_cosine_ops);
-- Binary (1 bit per dim) — ~32x storage cut, recall drops; useful as a coarse filter
CREATE INDEX items_emb_bit
ON items USING hnsw ((binary_quantize(embedding)::bit(1536)) bit_hamming_ops);
Pattern at scale: use binary HNSW for the coarse top-1000, then re-rank with full vectors. You get sub-50ms latency on tens of millions of vectors.
Re-embedding and dimension changes
You will at some point switch embedding models. The new vectors are not comparable to the old ones. Plan for it:
ALTER TABLE chunks ADD COLUMN embedding_v2 vector(3072); -- new dim
-- backfill in batches
UPDATE chunks SET embedding_v2 = embed(content) WHERE id BETWEEN ...;
-- new index
CREATE INDEX chunks_emb_v2_hnsw
ON chunks USING hnsw (embedding_v2 vector_cosine_ops) WITH (m = 16, ef_construction = 64);
-- traffic-shift: dual-read, then drop the old.
Don’t try to do an in-place column swap. Keep both for a release cycle, A/B retrieval, then drop.
A real-world tuning recipe
You inherited a 2M-row, 1536-dim table that’s slow. Steps:
- Confirm the index exists and is HNSW. (
\d+ tablein psql.) - Run
EXPLAIN ANALYZEwith a representative query. Confirm it’s using the HNSW index. - Bump
maintenance_work_memand rebuild ifmoref_constructionare too low. - Pre-warm:
SELECT pg_prewarm('items_emb_hnsw'); - Set
ef_search = 100as a session GUC; measure latency and recall. - Increase
shared_buffersto fit the index. - Add a partial index for hot filters.
- Bench the full workload, not single queries.
Eight steps. Most of them take minutes. None of them are exotic.
When pgvector stops being enough
I’d stop forcing pgvector and migrate at:
- >50M vectors with strict p99 latency budgets.
- Hot multi-tenant workloads with thousands of small tenants where partial indexes blow up your catalog.
- Streaming ingestion at >1k vec/s sustained.
At that point look at Qdrant, Weaviate, Milvus, or a managed offering. Below that line, Postgres + pgvector is the right call — and it lets you keep one database to back up.
Read this next
- Build a RAG App with pgvector and FastAPI — the application side.
- PostgreSQL Indexing and EXPLAIN — the basics this post built on.
- PostgreSQL Full-Text Search — the BM25 half of hybrid search.
If you want my full pgvector tuning checklist as a runbook, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .