Alembic Textbook Ch. 8 — Production Patterns and Recovery

Chapter 8, the final chapter: Alembic in production. Deploy strategies, rollback, recovery from broken migrations.

Deploy strategy

Three options:

1. K8s Job (recommended)

apiVersion: batch/v1
kind: Job
metadata: { name: migrate-{{ .Values.image.tag }} }
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myapp:{{ .Values.image.tag }}
          command: ["alembic", "upgrade", "head"]
      restartPolicy: OnFailure

Pre-deploy Job runs migrations once. App pods follow.

2. App startup

@asynccontextmanager
async def lifespan(app):
    cfg = Config("alembic.ini")
    command.upgrade(cfg, "head")
    yield

Simpler; works for single replica. For multi-replica: race conditions (only one should run).

Use pg_advisory_xact_lock to ensure single-runner:

async def run_migrations():
    async with engine.connect() as conn:
        await conn.execute(text("SELECT pg_advisory_xact_lock(99999)"))
        # ... run migrations under lock ...

3. CI / CD pipeline

CI runs alembic upgrade head against prod after deploy approval. Manual oversight.

For most teams: Job is the cleanest.

Migrations that take hours

For really long migrations:

1. Run during low-traffic window.
2. Notify team / ops.
3. Run as standalone (not in deploy hook).
4. Monitor for replica lag, locks, disk usage.
5. Have abort plan.

For non-blocking changes (e.g., CONCURRENTLY index): can run during normal hours.

What if a migration breaks

The classic failure:

1. Run alembic upgrade head.
2. Migration fails halfway.
3. alembic_version is unchanged (good — transaction rolled back).
4. Investigate; fix migration; re-run.

For migrations that aren’t transactional (CONCURRENTLY):

1. Migration fails halfway.
2. Some objects created (e.g., invalid index).
3. Manual cleanup needed.
4. Re-run migration.

For complex partial failures: read the error; clean up by hand.

stamp

Tells Alembic the DB is at a specific revision without running migrations:

alembic stamp head
alembic stamp <rev>

Use case 1: bootstrapping a DB from Base.metadata.create_all() — stamp head so Alembic doesn’t try to re-apply.

Use case 2: migration applied manually; tell Alembic.

Use case 3: recovering after a failed migration where you fixed by hand.

Recovering from inconsistent state

Scenario: migration partially applied; alembic_version not updated.

# Find current state
psql -c "SELECT * FROM alembic_version"

# Inspect what was applied
psql -c "\d users"  # check if expected columns exist

# Fix manually if needed (drop partial constructs)

# Stamp the right version
alembic stamp <rev>

This is rare with transactional migrations; common with CONCURRENTLY or DDL+DML mixes.

Migrations and read replicas

DDL replicates; data migrations replicate. Big migrations cause replica lag.

Mitigations:

Watch lag during migration.
Backfill in batches with pauses.
For critical: temporarily disable some replicas.

Migrations and connection pools

A long migration holds a connection. Pool needs free capacity:

Total connections = app pool + migration runner + replication
                  ≤ max_connections

Allow at least one extra connection for the migration.

What if you need to skip a migration

# Mark as applied without running
alembic stamp <rev>

For migrations that don’t apply to your DB / environment / state. Document why.

Old migrations: prune?

Generally: keep all. Forward replays from initial.

For very long histories: squash:

# Generate a new "initial" from current schema
alembic revision --autogenerate -m "squashed initial"
# Stamp existing DBs to the new initial
alembic stamp <new_rev>
# Delete old revisions

Risky; only do after careful planning. Most projects don’t need it.

Backup before risky migrations

pg_dump -h prod -U app db > backup-$(date +%Y%m%d-%H%M%S).sql

Always. Easier than a multi-hour recovery.

For managed DB: snapshot.

Migration testing in staging

Stage with prod-shaped data:

# Restore prod backup to staging
pg_restore -h staging -d app prod-backup.sql

# Run new migrations
alembic upgrade head

# Time it; verify

Catches “this migration takes 4 hours on real data” before prod.

Alembic-utils for views, triggers, functions

from alembic_utils.pg_function import PGFunction

func = PGFunction(
    schema="public",
    signature="update_modified_column()",
    definition="...",
)

# Auto-tracks changes; generates migrations for view / trigger / function changes

For PostgresQL-specific objects beyond what stock Alembic tracks.

Production checklist

Before deploying a migration:

Migration tested in staging with prod-shaped data.
Estimated runtime documented.
Backup taken.
Replica lag plan.
Rollback strategy.
Comm to ops if long-running.
Monitoring active.
Compatible with running app version (expand-contract).

Common mistakes

1. Migration without staging test

“It’s just a column add.” Until it locks for hours on a 100GB table.

2. Migrating in app startup with multiple replicas

Race; lock contention. Use a Job.

3. No backup before risky changes

You can’t undo a DROP TABLE without backup.

4. Mixing DDL and DML in one transaction (PG)

Can work; can also surprise you. Explicit transactions.

5. Forgetting to update schema_version on hand-fixes

DB and Alembic disagree. alembic stamp to reconcile.

You’ve finished

Across this textbook:

Setup and env.py.
Autogenerate.
Manual migrations.
Branching and merging.
Online schema changes.
Multi-DB / multi-tenant.
CI integration.
Production and recovery.

Pair with:

If you want my full FastAPI + SQLAlchemy + Pydantic + Alembic production starter, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Deploy strategy#

1. K8s Job (recommended)#

2. App startup#

3. CI / CD pipeline#

Migrations that take hours#

What if a migration breaks#

stamp#

Recovering from inconsistent state#

Migrations and read replicas#

Migrations and connection pools#

What if you need to skip a migration#

Old migrations: prune?#

Backup before risky migrations#

Migration testing in staging#

Alembic-utils for views, triggers, functions#

Production checklist#

Common mistakes#

1. Migration without staging test#

2. Migrating in app startup with multiple replicas#

3. No backup before risky changes#

4. Mixing DDL and DML in one transaction (PG)#

5. Forgetting to update schema_version on hand-fixes#

You’ve finished#