Alembic Cheatsheet 09 — Production Patterns

Cheatsheet for running Alembic in production.

Deploy strategy

K8s Job (recommended)

Pre-deploy Job applies migrations; app pods follow.

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-{{ .Values.image.tag }}
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myapp:{{ .Values.image.tag }}
          command: ["alembic", "upgrade", "head"]
      restartPolicy: OnFailure

Idempotent; Alembic skips already-applied.

App startup

@asynccontextmanager
async def lifespan(app):
    cfg = Config("alembic.ini")
    command.upgrade(cfg, "head")
    yield

Works for single replica. For multi-replica: race conditions.

App startup with advisory lock

async def run_migrations():
    async with engine.connect() as conn:
        await conn.execute(text("SELECT pg_advisory_xact_lock(99999)"))
        await asyncio.to_thread(_apply_migrations)

@asynccontextmanager
async def lifespan(app):
    await run_migrations()
    yield

Only one replica acquires the lock; others wait. Cheap; works.

Migrations that take hours

For really long migrations:

1. Low-traffic window.
2. Standalone Job (not in deploy critical path).
3. Monitor replica lag.
4. Have abort plan.

For non-blocking ops (CONCURRENTLY index): can run during normal hours.

Forward-only in prod

Don’t alembic downgrade -1 in prod. Roll forward with a corrective migration.

For rollback: write a forward migration that undoes the change. Apply like any other.

Stamp (mark as applied without running)

alembic stamp head             # set current = head
alembic stamp <rev>            # specific

Use cases:

Bootstrapping from Base.metadata.create_all().
Recovering after manual schema fixes.
Skipping a migration applied out-of-band.

Recovering from inconsistent state

Migration partially applied; alembic_version not updated:

# Check state
psql -c "SELECT version_num FROM alembic_version"
psql -c "\d users"          # check actual schema

# If partial: clean up manually
psql -c "DROP INDEX IF EXISTS partial_index_left_over"

# Stamp the right version
alembic stamp <rev>

Backup before risky migrations

pg_dump -h prod -d mydb -Fc > pre-migration-$(date +%Y%m%d-%H%M%S).dump

Or use snapshot (managed DB).

Replica lag during migration

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) FROM pg_stat_replication;

Big spike → consider pausing or splitting the migration.

Migration testing in staging

Stage with prod-shaped data:

pg_dump prod > prod-snap.sql
psql -h staging -c "DROP DATABASE app; CREATE DATABASE app"
psql -h staging -d app < prod-snap.sql

# Run new migrations
alembic upgrade head
# Time + verify

Catches “this migration is fast on dev but takes hours on real data.”

Multi-app coordinate

For monorepos / multi-app:

Each app has its own alembic.ini and migration tree.
Shared schema modules require shared migration discipline.

Production checklist

Tested in staging with prod-shaped data.
Estimated runtime documented.
Backup available.
Replica lag plan.
Rollback strategy (forward).
Comm to ops if long.
Migration runs as K8s Job (not multi-replica app startup).
Compatible with running app version (expand-contract).
Monitoring active.

Common mistakes

App startup migration with N replicas → race conditions.
VACUUM FULL or CREATE INDEX without CONCURRENTLY on hot tables → outage.
Stamping without understanding consequences → DB diverges from history.
Editing already-applied migration files → deployed instances diverge.

Deploy strategy#

K8s Job (recommended)#

App startup#

App startup with advisory lock#

Migrations that take hours#

Forward-only in prod#

Stamp (mark as applied without running)#

Recovering from inconsistent state#

Backup before risky migrations#

Replica lag during migration#

Migration testing in staging#

Multi-app coordinate#

Production checklist#

Common mistakes#

Read this next#