pgvector turns Postgres into a respectable vector database. The defaults are fine for prototypes and bad for production. This post is the missing operator’s manual: how the indexes work, what the knobs do, and the settings that actually matter.

If you want the building side of vector search, see Build a RAG App with pgvector and FastAPI . This post is about the database internals.

The vector type

CREATE EXTENSION vector;

CREATE TABLE items (
  id        BIGSERIAL PRIMARY KEY,
  embedding vector(1536)
);

A vector(d) is a 4 * d byte array of float4 plus a length header. 1536-dim embeddings = ~6 KB per row. A million rows = ~6 GB on disk. Plan accordingly.

Distance operators

OperatorDistanceUse when
<->L2 (Euclidean)Embeddings already L2-normalized; image features
<=>CosineMost LLM embeddings (OpenAI, Cohere, Voyage)
<#>Negative inner productSame as cosine when vectors are unit-norm
<+>L1 (Manhattan)Rare; specific feature spaces

Pick the operator that matches your embedding model’s training objective. OpenAI’s text-embedding-3-* are trained for cosine similarity → use <=>. Mismatching distance and embedding space silently tanks recall.

The two indexes

IVFFlat — fast to build, less recall

CREATE INDEX items_emb_ivfflat
  ON items USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

How it works: clusters vectors into lists Voronoi cells. Queries scan probes cells (default 1, set higher for recall).

SET ivfflat.probes = 10;          -- query-time recall knob

Tradeoffs:

  • Pros: Faster build, smaller index, simpler to reason about.
  • Cons: Recall is sensitive to the dataset; needs roughly sqrt(rows) lists for good defaults; updates can degrade quality over time (re-cluster needed).

I no longer reach for IVFFlat in 2026. HNSW is just better at almost every scale.

HNSW — the default in 2026

CREATE INDEX items_emb_hnsw
  ON items USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

How it works: a multi-layer graph where each node has up to m outgoing edges. Queries do greedy search top-down. Closer to a graph algorithm than a clustering algorithm.

Tradeoffs:

  • Pros: Excellent recall, no clustering parameter to guess, supports updates without retrain, works at every scale.
  • Cons: Slower to build (O(N · ef_construction · m)), bigger index, RAM-hungry (you really want it to fit in shared_buffers).

Build parameters (HNSW)

m — graph connectivity

Each node has up to m neighbors. Higher m = better recall, bigger index, slower build.

mWhen
8Memory-constrained, ≤1M vectors, recall not critical
16Default, most workloads
32Hard recall targets (>0.95 @ k=10), willing to pay 2x build time
48Diminishing returns; rarely worth it

ef_construction — build-time search width

How aggressively neighbors are searched while building. Higher = better recall, much slower build.

ef_constructionWhen
40Quick prototyping
64Default
100Production, recall matters
200Last resort; halves build speed for marginal recall gain

Build cost grows linearly with both m and ef_construction. A 10M-vector index at m=32, ef_construction=200 can take hours; the same index at m=16, ef_construction=64 builds in tens of minutes.

Query parameters (HNSW)

ef_search — query-time search width

SET hnsw.ef_search = 100;     -- per session

Or per-query in pgvector ≥ 0.7 with a hint. The single most important knob in production.

ef_searchBehavior
40 (default)Fast, lower recall
100Sweet spot for k=10–30 retrieval
200+Recall priority, latency cost

Set ef_search ≥ 2 × k as a rule of thumb. If you’re retrieving top 30, use 60+. If you’re rerankering downstream and fetching top 100, use 200.

Bench your real workload

You will be tempted to copy somebody else’s benchmark numbers. Don’t. Your embedding distribution, your filters, your hardware, and your concurrency change everything.

Minimal harness:

# bench.py
import time, statistics, random, asyncpg, asyncio


async def main():
    pool = await asyncpg.create_pool(DATABASE_URL, min_size=8, max_size=8)

    # Pre-warm the index into memory.
    async with pool.acquire() as c:
        await c.execute("SET hnsw.ef_search = 100")
        await c.execute("SELECT pg_prewarm('items_emb_hnsw')")

    queries = load_query_embeddings(500)

    async def one(q):
        t = time.perf_counter()
        async with pool.acquire() as c:
            await c.fetch(
                "SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 30",
                q,
            )
        return (time.perf_counter() - t) * 1000

    # 8 concurrent workers
    sem = asyncio.Semaphore(8)
    async def gated(q):
        async with sem:
            return await one(q)

    times = await asyncio.gather(*(gated(q) for q in queries))
    print(f"p50: {statistics.median(times):.1f}ms  "
          f"p95: {sorted(times)[int(len(times)*0.95)]:.1f}ms  "
          f"p99: {sorted(times)[int(len(times)*0.99)]:.1f}ms")

Run it before and after each tuning change. Always at concurrency that matches production.

Recall measurement

Latency is half the picture. Recall is the other half — and it can drop silently if you change ef_search, m, or your data distribution.

-- Ground truth via brute-force scan (drop the index temporarily, or use SET enable_indexscan = off)
SET enable_indexscan = off;
SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 30;  -- ground truth
RESET enable_indexscan;

SELECT id FROM items ORDER BY embedding <=> $1 LIMIT 30;  -- HNSW

Recall@30 = (intersection size) / 30, averaged over your test queries. Aim for ≥ 0.95 for RAG, ≥ 0.99 for safety-critical retrieval.

The killer feature of pgvector vs. dedicated vector DBs is filtering with normal SQL:

SELECT id
FROM items
WHERE tenant_id = $1 AND category = $2
ORDER BY embedding <=> $3
LIMIT 30;

But filtering interacts badly with HNSW. The planner has two paths:

  1. Pre-filter — run the WHERE first, then sequential-scan + sort by distance. Slow on big tables.
  2. Post-filter — use the index, then drop rows that don’t match. Can lose results if the filter is selective.

Two strategies that work:

Strategy 1 — Partial indexes per tenant

CREATE INDEX items_emb_tenant_42
  ON items USING hnsw (embedding vector_cosine_ops)
  WHERE tenant_id = 42;

Best when tenants are large and stable. The planner uses the right index, search stays fast.

Strategy 2 — Iterative scan (pgvector ≥ 0.8)

pgvector 0.8+ added iterative HNSW scans that play nicely with filters. The planner walks the index returning candidates that match the filter, refilling as needed.

SET hnsw.iterative_scan = strict_order;
-- or 'relaxed_order' if you want fewer guarantees but lower latency

This dramatically improves filtered search recall in 2026. Use it.

Storage and memory

  • One HNSW edge per neighbor: m × 4 bytes per node, plus the vector itself.
  • A 1M-vector, 1536-dim, m=16 HNSW index ≈ 6 GB vectors + 60 MB graph metadata.
  • For best latency, both must fit in shared_buffers or the OS page cache.
# postgresql.conf — sane defaults for a vector workload on a 32 GB box
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 4GB        # speeds up index builds dramatically
work_mem = 32MB

maintenance_work_mem is the magic one for build speed. The default 64 MB is laughable for an HNSW index of any size. Bump to multiple GB during build, set back after.

Quantization (pgvector 0.7+)

For very large datasets, you can store vectors at lower precision:

-- Half-precision (float16) — 50% storage, ~no recall loss for most embeddings
CREATE INDEX items_emb_half
  ON items USING hnsw ((embedding::halfvec(1536)) halfvec_cosine_ops);

-- Binary (1 bit per dim) — ~32x storage cut, recall drops; useful as a coarse filter
CREATE INDEX items_emb_bit
  ON items USING hnsw ((binary_quantize(embedding)::bit(1536)) bit_hamming_ops);

Pattern at scale: use binary HNSW for the coarse top-1000, then re-rank with full vectors. You get sub-50ms latency on tens of millions of vectors.

Re-embedding and dimension changes

You will at some point switch embedding models. The new vectors are not comparable to the old ones. Plan for it:

ALTER TABLE chunks ADD COLUMN embedding_v2 vector(3072);   -- new dim
-- backfill in batches
UPDATE chunks SET embedding_v2 = embed(content) WHERE id BETWEEN ...;

-- new index
CREATE INDEX chunks_emb_v2_hnsw
  ON chunks USING hnsw (embedding_v2 vector_cosine_ops) WITH (m = 16, ef_construction = 64);

-- traffic-shift: dual-read, then drop the old.

Don’t try to do an in-place column swap. Keep both for a release cycle, A/B retrieval, then drop.

A real-world tuning recipe

You inherited a 2M-row, 1536-dim table that’s slow. Steps:

  1. Confirm the index exists and is HNSW. (\d+ table in psql.)
  2. Run EXPLAIN ANALYZE with a representative query. Confirm it’s using the HNSW index.
  3. Bump maintenance_work_mem and rebuild if m or ef_construction are too low.
  4. Pre-warm: SELECT pg_prewarm('items_emb_hnsw');
  5. Set ef_search = 100 as a session GUC; measure latency and recall.
  6. Increase shared_buffers to fit the index.
  7. Add a partial index for hot filters.
  8. Bench the full workload, not single queries.

Eight steps. Most of them take minutes. None of them are exotic.

When pgvector stops being enough

I’d stop forcing pgvector and migrate at:

  • >50M vectors with strict p99 latency budgets.
  • Hot multi-tenant workloads with thousands of small tenants where partial indexes blow up your catalog.
  • Streaming ingestion at >1k vec/s sustained.

At that point look at Qdrant, Weaviate, Milvus, or a managed offering. Below that line, Postgres + pgvector is the right call — and it lets you keep one database to back up.

Read this next

If you want my full pgvector tuning checklist as a runbook, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .