Chapter 4: Postgres full-text search via SQLAlchemy. tsvector, tsquery, GIN, ranking, websearch.
Setup
from sqlalchemy import Computed, Index
from sqlalchemy.dialects.postgresql import TSVECTOR
class Post(Base):
__tablename__ = "posts"
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str]
body: Mapped[str]
search_vec: Mapped[str] = mapped_column(
TSVECTOR,
Computed(
"setweight(to_tsvector('english', coalesce(title, '')), 'A') || "
"setweight(to_tsvector('english', coalesce(body, '')), 'B')",
persisted=True,
),
)
__table_args__ = (
Index("ix_posts_search_vec", "search_vec", postgresql_using="gin"),
)
Computed(persisted=True) = STORED generated column. setweight lets you boost matches in title vs body during ranking.
Querying
from sqlalchemy import func, text
q = "postgres jsonb"
stmt = (
select(Post)
.where(Post.search_vec.op("@@")(func.websearch_to_tsquery("english", q)))
.order_by(func.ts_rank(Post.search_vec, func.websearch_to_tsquery("english", q)).desc())
.limit(20)
)
@@ is the match operator. websearch_to_tsquery parses Google-like queries ("phrase", or, -not).
Ranking variants
ts_rank: word frequency–based.ts_rank_cd: cover density (proximity-aware).
.order_by(func.ts_rank_cd(Post.search_vec, q_ts).desc())
ts_rank_cd usually feels more relevant.
Highlighting
stmt = select(
Post,
func.ts_headline("english", Post.body, q_ts, "MaxWords=30, MinWords=10, ShortWord=3, MaxFragments=2").label("highlight")
)
Returns a snippet with matched terms wrapped in <b>...</b> (configurable).
Configuration / language
to_tsvector('english', text) # English stemming
to_tsvector('simple', text) # No stemming
For multi-language: store the language column and dispatch:
to_tsvector(language::regconfig, text)
Synonyms / dictionaries
Postgres FTS supports custom dictionaries / thesaurus / synonyms. For most apps: out-of-the-box english is fine.
For domain-specific (medical, legal): build a custom dictionary or use Elasticsearch.
When to use Postgres FTS
- Up to ~10M docs at moderate query rates.
- Single-language or simple multi-language.
- Don’t need typo tolerance or aggressive ranking.
For more: Meilisearch / Typesense / Elasticsearch. See Search System Design .
Trigram fuzzy match
For typo tolerance:
# CREATE EXTENSION pg_trgm
__table_args__ = (
Index("ix_posts_title_trgm", "title", postgresql_using="gin", postgresql_ops={"title": "gin_trgm_ops"}),
)
# Query
stmt = select(Post).where(Post.title.op("%")(q)) # similarity > threshold
stmt = select(Post).where(Post.title.ilike(f"%{q}%")) # uses trigram index
pg_trgm enables fast LIKE '%...%' and similarity matching. Pair with FTS for hybrid.
Hybrid (FTS + trigram)
fts_stmt = select(Post.id).where(Post.search_vec.op("@@")(q_ts))
trgm_stmt = select(Post.id).where(Post.title.op("%")(q))
stmt = select(Post).where(Post.id.in_(fts_stmt.union(trgm_stmt)))
Or use Reciprocal Rank Fusion for combined ranking (see hybrid search ).
Phrase search
.where(Post.search_vec.op("@@")(func.phraseto_tsquery("english", "exact phrase")))
Or in websearch_to_tsquery: "exact phrase" syntax.
Faceted
For “filter by category + search”:
stmt = (
select(Post)
.where(Post.search_vec.op("@@")(q_ts))
.where(Post.category == category)
)
Composite index helps:
__table_args__ = (
Index("ix_posts_category_search", "category", "search_vec", postgresql_using="gin"),
)
Note: GIN multi-column requires extension btree_gin:
CREATE EXTENSION btree_gin;
Update strategies
For large existing tables: backfill the search_vec column in batches; then add the GIN index concurrently.
# Migration
op.add_column("posts", sa.Column("search_vec", postgresql.TSVECTOR))
op.execute("UPDATE posts SET search_vec = setweight(to_tsvector('english', coalesce(title, '')), 'A') || setweight(to_tsvector('english', coalesce(body, '')), 'B')")
op.execute("CREATE INDEX CONCURRENTLY ix_posts_search ON posts USING gin (search_vec)")
For Computed columns added on existing data: PG fills in lazily; might want explicit update.
RUM index (3rd-party)
For ranking-aware GIN: rum extension:
CREATE EXTENSION rum;
Stores ranking metadata in the index. Faster ts_rank queries. Requires explicit config.
Common mistakes
1. tsvector without GIN
Sequential scans every query. Always GIN.
2. Storing tsvector ad-hoc
Inconsistent. Use a generated column.
3. Ignoring coalesce
NULL fields produce NULL tsvector. coalesce(title, '') for safety.
4. websearch with single language
Multi-language needs per-row config.
5. Trying to compete with Elasticsearch
For complex ranking / typo tolerance / faceting at scale: Postgres FTS isn’t the best tool. Match scope to capability.
What’s next
Chapter 5: Locking and advisory locks.
Read this next
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .