Chapter 8, the final chapter: Alembic in production. Deploy strategies, rollback, recovery from broken migrations.
Deploy strategy
Three options:
1. K8s Job (recommended)
apiVersion: batch/v1
kind: Job
metadata: { name: migrate-{{ .Values.image.tag }} }
spec:
template:
spec:
containers:
- name: migrate
image: myapp:{{ .Values.image.tag }}
command: ["alembic", "upgrade", "head"]
restartPolicy: OnFailure
Pre-deploy Job runs migrations once. App pods follow.
2. App startup
@asynccontextmanager
async def lifespan(app):
cfg = Config("alembic.ini")
command.upgrade(cfg, "head")
yield
Simpler; works for single replica. For multi-replica: race conditions (only one should run).
Use pg_advisory_xact_lock to ensure single-runner:
async def run_migrations():
async with engine.connect() as conn:
await conn.execute(text("SELECT pg_advisory_xact_lock(99999)"))
# ... run migrations under lock ...
3. CI / CD pipeline
CI runs alembic upgrade head against prod after deploy approval. Manual oversight.
For most teams: Job is the cleanest.
Migrations that take hours
For really long migrations:
1. Run during low-traffic window.
2. Notify team / ops.
3. Run as standalone (not in deploy hook).
4. Monitor for replica lag, locks, disk usage.
5. Have abort plan.
For non-blocking changes (e.g., CONCURRENTLY index): can run during normal hours.
What if a migration breaks
The classic failure:
1. Run alembic upgrade head.
2. Migration fails halfway.
3. alembic_version is unchanged (good — transaction rolled back).
4. Investigate; fix migration; re-run.
For migrations that aren’t transactional (CONCURRENTLY):
1. Migration fails halfway.
2. Some objects created (e.g., invalid index).
3. Manual cleanup needed.
4. Re-run migration.
For complex partial failures: read the error; clean up by hand.
stamp
Tells Alembic the DB is at a specific revision without running migrations:
alembic stamp head
alembic stamp <rev>
Use case 1: bootstrapping a DB from Base.metadata.create_all() — stamp head so Alembic doesn’t try to re-apply.
Use case 2: migration applied manually; tell Alembic.
Use case 3: recovering after a failed migration where you fixed by hand.
Recovering from inconsistent state
Scenario: migration partially applied; alembic_version not updated.
# Find current state
psql -c "SELECT * FROM alembic_version"
# Inspect what was applied
psql -c "\d users" # check if expected columns exist
# Fix manually if needed (drop partial constructs)
# Stamp the right version
alembic stamp <rev>
This is rare with transactional migrations; common with CONCURRENTLY or DDL+DML mixes.
Migrations and read replicas
DDL replicates; data migrations replicate. Big migrations cause replica lag.
Mitigations:
- Watch lag during migration.
- Backfill in batches with pauses.
- For critical: temporarily disable some replicas.
Migrations and connection pools
A long migration holds a connection. Pool needs free capacity:
Total connections = app pool + migration runner + replication
≤ max_connections
Allow at least one extra connection for the migration.
What if you need to skip a migration
# Mark as applied without running
alembic stamp <rev>
For migrations that don’t apply to your DB / environment / state. Document why.
Old migrations: prune?
Generally: keep all. Forward replays from initial.
For very long histories: squash:
# Generate a new "initial" from current schema
alembic revision --autogenerate -m "squashed initial"
# Stamp existing DBs to the new initial
alembic stamp <new_rev>
# Delete old revisions
Risky; only do after careful planning. Most projects don’t need it.
Backup before risky migrations
pg_dump -h prod -U app db > backup-$(date +%Y%m%d-%H%M%S).sql
Always. Easier than a multi-hour recovery.
For managed DB: snapshot.
Migration testing in staging
Stage with prod-shaped data:
# Restore prod backup to staging
pg_restore -h staging -d app prod-backup.sql
# Run new migrations
alembic upgrade head
# Time it; verify
Catches “this migration takes 4 hours on real data” before prod.
Alembic-utils for views, triggers, functions
from alembic_utils.pg_function import PGFunction
func = PGFunction(
schema="public",
signature="update_modified_column()",
definition="...",
)
# Auto-tracks changes; generates migrations for view / trigger / function changes
For PostgresQL-specific objects beyond what stock Alembic tracks.
Production checklist
Before deploying a migration:
- Migration tested in staging with prod-shaped data.
- Estimated runtime documented.
- Backup taken.
- Replica lag plan.
- Rollback strategy.
- Comm to ops if long-running.
- Monitoring active.
- Compatible with running app version (expand-contract).
Common mistakes
1. Migration without staging test
“It’s just a column add.” Until it locks for hours on a 100GB table.
2. Migrating in app startup with multiple replicas
Race; lock contention. Use a Job.
3. No backup before risky changes
You can’t undo a DROP TABLE without backup.
4. Mixing DDL and DML in one transaction (PG)
Can work; can also surprise you. Explicit transactions.
5. Forgetting to update schema_version on hand-fixes
DB and Alembic disagree. alembic stamp to reconcile.
You’ve finished
Across this textbook:
- Setup and env.py.
- Autogenerate.
- Manual migrations.
- Branching and merging.
- Online schema changes.
- Multi-DB / multi-tenant.
- CI integration.
- Production and recovery.
Pair with:
- SQLAlchemy DB-Agnostic Textbook
- Postgres-Focused SQLAlchemy Textbook
- Pydantic v2 Textbook
- FastAPI Textbook
If you want my full FastAPI + SQLAlchemy + Pydantic + Alembic production starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .