SQLAlchemy + Postgres Cheatsheet 15 — Replication, Read Replicas, and Failover

Cheatsheet for Postgres replication and failover.

Two engines (writer / reader)

write_engine = create_async_engine(WRITER_URL, pool_size=20, pool_pre_ping=True)
read_engine = create_async_engine(
    READER_URL,
    pool_size=40,
    pool_pre_ping=True,
    connect_args={"server_settings": {"default_transaction_read_only": "on"}},
)

WriteSession = async_sessionmaker(write_engine, expire_on_commit=False)
ReadSession = async_sessionmaker(read_engine, expire_on_commit=False)

Read engine forces RO at session start — writes raise loud errors.

DI routing

async def get_write_db() -> AsyncSession:
    async with WriteSession() as s: yield s

async def get_read_db() -> AsyncSession:
    async with ReadSession() as s: yield s

@app.post("/users")
async def create_user(data: UserIn, db = Depends(get_write_db)):
    ...

@app.get("/users")
async def list_users(db = Depends(get_read_db)):
    ...

Multi-host failover (asyncpg)

engine = create_async_engine(
    "postgresql+asyncpg://app@primary:5432,replica:5432/db?target_session_attrs=read-write"
)

Tries each host; picks one that’s read-write. On failover, automatically uses the new primary.

Patroni / Stolon front-end

App connects to a stable endpoint (HAProxy / pgcat). Patroni promotes on primary failure; routing layer points to new primary.

Read-after-write strategy

User writes; immediately reads; replica lag may show stale data.

Mark recent writers; route their reads to primary briefly:

RECENT_TTL = 5

async def get_db_smart(request: Request) -> AsyncSession:
    user_id = getattr(request.state, "user_id", None)
    if user_id and await redis.exists(f"recent_writer:{user_id}"):
        SM = WriteSession
    else:
        SM = ReadSession
    async with SM() as s:
        yield s

# After every write
await redis.set(f"recent_writer:{user_id}", "1", ex=RECENT_TTL)

Causal token

Stronger guarantee: client passes the LSN of their last write; reader waits until replica replays past it.

# Server
lsn = await session.scalar(text("SELECT pg_current_wal_lsn()"))
# Send `lsn` to client

# On next read with X-Read-After: <lsn>
await session.execute(text("SELECT pg_wait_for_lsn_to_replay(:lsn, :ms)"), {"lsn": ..., "ms": 5000})

(Available in some Postgres distributions / extensions.)

Replica lag check

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn());

Bytes behind. Run on primary or replica respectively. Alert if > N MB.

pool_pre_ping after failover

engine = create_async_engine(URL, pool_pre_ping=True, pool_recycle=300)

Dead connections get replaced after the primary restart.

Region-aware routing

# Closest replica
us_east_engine = create_async_engine(US_READER_URL)
eu_engine = create_async_engine(EU_READER_URL)

async def get_local_read_db(request: Request) -> AsyncSession:
    region = request.headers.get("cf-ipcountry-region", "us")
    SM = ReadSession if region == "us" else EU_ReadSession
    async with SM() as s:
        yield s

For globally distributed users.

Force read on primary (e.g., immediately after write)

@app.post("/transfer")
async def transfer(data: TransferIn, w = Depends(get_write_db)):
    # All reads + writes in this request hit the primary
    ...

RDS Aurora reader endpoints

AWS Aurora exposes:

Cluster endpoint (writer; auto-failovers).
Reader endpoint (load-balanced across replicas).
Custom endpoints.

write_engine = create_async_engine("postgresql+asyncpg://app@aurora-cluster.../db")
read_engine = create_async_engine("postgresql+asyncpg://app@aurora-cluster-ro.../db")

GCP Cloud SQL read replicas

read_engine = create_async_engine("postgresql+asyncpg://app@replica-1.../db")

Or use the proxy.

Common mistakes

Sending writes to read replicas — fails (good); but worse if RO not enforced.
No pool_pre_ping — stale conns after failover.
Caching read-replica connection without recycle — old data after promotion.
Not handling replica lag in UX — confused users.

Two engines (writer / reader)#

DI routing#

Multi-host failover (asyncpg)#

Patroni / Stolon front-end#

Read-after-write strategy#

Causal token#

Replica lag check#

pool_pre_ping after failover#

Region-aware routing#

Force read on primary (e.g., immediately after write)#

RDS Aurora reader endpoints#

GCP Cloud SQL read replicas#

Common mistakes#

Read this next#