Cheatsheet for Postgres replication and failover.
Two engines (writer / reader)
write_engine = create_async_engine(WRITER_URL, pool_size=20, pool_pre_ping=True)
read_engine = create_async_engine(
READER_URL,
pool_size=40,
pool_pre_ping=True,
connect_args={"server_settings": {"default_transaction_read_only": "on"}},
)
WriteSession = async_sessionmaker(write_engine, expire_on_commit=False)
ReadSession = async_sessionmaker(read_engine, expire_on_commit=False)
Read engine forces RO at session start — writes raise loud errors.
DI routing
async def get_write_db() -> AsyncSession:
async with WriteSession() as s: yield s
async def get_read_db() -> AsyncSession:
async with ReadSession() as s: yield s
@app.post("/users")
async def create_user(data: UserIn, db = Depends(get_write_db)):
...
@app.get("/users")
async def list_users(db = Depends(get_read_db)):
...
Multi-host failover (asyncpg)
engine = create_async_engine(
"postgresql+asyncpg://app@primary:5432,replica:5432/db?target_session_attrs=read-write"
)
Tries each host; picks one that’s read-write. On failover, automatically uses the new primary.
Patroni / Stolon front-end
App connects to a stable endpoint (HAProxy / pgcat). Patroni promotes on primary failure; routing layer points to new primary.
Read-after-write strategy
User writes; immediately reads; replica lag may show stale data.
Mark recent writers; route their reads to primary briefly:
RECENT_TTL = 5
async def get_db_smart(request: Request) -> AsyncSession:
user_id = getattr(request.state, "user_id", None)
if user_id and await redis.exists(f"recent_writer:{user_id}"):
SM = WriteSession
else:
SM = ReadSession
async with SM() as s:
yield s
# After every write
await redis.set(f"recent_writer:{user_id}", "1", ex=RECENT_TTL)
Causal token
Stronger guarantee: client passes the LSN of their last write; reader waits until replica replays past it.
# Server
lsn = await session.scalar(text("SELECT pg_current_wal_lsn()"))
# Send `lsn` to client
# On next read with X-Read-After: <lsn>
await session.execute(text("SELECT pg_wait_for_lsn_to_replay(:lsn, :ms)"), {"lsn": ..., "ms": 5000})
(Available in some Postgres distributions / extensions.)
Replica lag check
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn());
Bytes behind. Run on primary or replica respectively. Alert if > N MB.
pool_pre_ping after failover
engine = create_async_engine(URL, pool_pre_ping=True, pool_recycle=300)
Dead connections get replaced after the primary restart.
Region-aware routing
# Closest replica
us_east_engine = create_async_engine(US_READER_URL)
eu_engine = create_async_engine(EU_READER_URL)
async def get_local_read_db(request: Request) -> AsyncSession:
region = request.headers.get("cf-ipcountry-region", "us")
SM = ReadSession if region == "us" else EU_ReadSession
async with SM() as s:
yield s
For globally distributed users.
Force read on primary (e.g., immediately after write)
@app.post("/transfer")
async def transfer(data: TransferIn, w = Depends(get_write_db)):
# All reads + writes in this request hit the primary
...
RDS Aurora reader endpoints
AWS Aurora exposes:
- Cluster endpoint (writer; auto-failovers).
- Reader endpoint (load-balanced across replicas).
- Custom endpoints.
write_engine = create_async_engine("postgresql+asyncpg://app@aurora-cluster.../db")
read_engine = create_async_engine("postgresql+asyncpg://app@aurora-cluster-ro.../db")
GCP Cloud SQL read replicas
read_engine = create_async_engine("postgresql+asyncpg://app@replica-1.../db")
Or use the proxy.
Common mistakes
- Sending writes to read replicas — fails (good); but worse if RO not enforced.
- No
pool_pre_ping— stale conns after failover. - Caching read-replica connection without recycle — old data after promotion.
- Not handling replica lag in UX — confused users.
Read this next
If you want my writer/reader routing + lag-monitor template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .