Chapter 5: zero-downtime migrations. The expand-contract pattern, batched backfills, concurrent indexes, big-table strategies. Postgres-focused (other DBs have similar patterns).
Expand-contract
For non-trivial schema changes (rename, type change, FK changes):
Phase 1 (expand): Add new column / table / index. Code writes both old and new. Reads new with fallback.
Phase 2 (contract): Code reads/writes new only. Drop old.
Each phase is its own migration + deploy. Between: system functional.
Example: rename column
Bad:
op.alter_column("users", "email", new_column_name="email_address")
In-flight requests using email break.
Good:
# Migration 1 (expand)
def upgrade():
op.add_column("users", sa.Column("email_address", sa.String(255), nullable=True))
op.execute("UPDATE users SET email_address = email")
# Code writes both; reads from email_address with fallback
Wait, observe.
# Migration 2 (contract)
def upgrade():
op.alter_column("users", "email_address", nullable=False)
op.drop_column("users", "email")
# Code now uses email_address only
Two deploys. No downtime.
NOT NULL on existing column
# Bad: hours of locking
op.alter_column("users", "status", nullable=False)
The DB validates every row. Rewrites the table for nullable changes (in older PG).
PG 11+: SET NOT NULL doesn’t rewrite. But still scans every row.
Better:
# 1. Add nullable
op.add_column("users", sa.Column("status", sa.String(20), nullable=True))
# 2. Backfill in batches (separate migration or one-off)
op.execute("UPDATE users SET status = 'active' WHERE status IS NULL") # potentially fast if few NULLs after PG 11+
# 3. Mark NOT NULL
op.alter_column("users", "status", nullable=False)
For really big tables: backfill outside Alembic in chunks.
Adding NOT NULL with DEFAULT
PG 11+:
op.add_column("users", sa.Column("status", sa.String(20), nullable=False, server_default="active"))
Doesn’t rewrite the table. Existing rows get the default lazily. Fast.
For older PG: rewrites entire table; locks; bad.
Batched backfill
For huge tables, run outside Alembic:
async def backfill():
while True:
async with AsyncSessionLocal() as s:
result = await s.execute(
text("UPDATE users SET status = 'active' "
"WHERE id IN (SELECT id FROM users WHERE status IS NULL ORDER BY id LIMIT 5000)")
)
await s.commit()
if result.rowcount == 0:
break
await asyncio.sleep(0.1)
Locks per-batch are tiny. No replica lag spike.
Concurrent indexes
def upgrade():
op.execute("CREATE INDEX CONCURRENTLY ix_users_email ON users (email)")
def downgrade():
op.execute("DROP INDEX CONCURRENTLY ix_users_email")
CONCURRENTLY can’t run in a transaction. Configure:
# In env.py or alembic.ini
[alembic]
transaction_per_migration = false
Or:
# In the migration
from alembic import op
def upgrade():
op.execute("COMMIT")
op.execute("CREATE INDEX CONCURRENTLY ...")
Be aware: a failed CONCURRENTLY index leaves an INVALID index. Drop manually before retrying.
NOT VALID FK
For adding FK to a big table:
# Bad
op.create_foreign_key("fk_orders_user", "orders", "users", ["user_id"], ["id"])
# locks for validation
# Better
op.create_foreign_key("fk_orders_user", "orders", "users", ["user_id"], ["id"], postgresql_not_valid=True)
# Then later:
op.execute("ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user")
NOT VALID adds the constraint without checking existing rows. VALIDATE CONSTRAINT checks them later (less locking).
Adding columns with computed defaults
op.add_column("posts", sa.Column("slug", sa.String(255)))
op.execute("UPDATE posts SET slug = lower(regexp_replace(title, '[^a-zA-Z0-9]+', '-', 'g'))")
op.alter_column("posts", "slug", nullable=False)
op.create_index("ix_posts_slug", "posts", ["slug"], unique=True)
Backfill, then constraint, then index.
For the unique constraint to succeed: slugs must be unique post-backfill. Validate in code.
Type changes
# Bad: rewrites table
op.alter_column("users", "id", type_=sa.BigInteger)
# Better
# 1. Add new column
op.add_column("users", sa.Column("id_new", sa.BigInteger))
# 2. Backfill in batches
# 3. Add unique constraint
# 4. Switch FKs to point to id_new
# 5. Drop old id
# 6. Rename
Multi-phase. Hours of orchestration. Worth it for big tables.
DROP in production
def upgrade():
op.drop_column("users", "old_field")
Drops happen instantly (PG marks as deleted; vacuum reclaims). Safe.
But: if your code still references old_field, deploy fails. Sequence: deploy code that doesn’t use it; THEN drop.
pg_repack for big-table changes
For changes that need full table rewrite (e.g., changing primary key, adding column with non-default value):
pg_repack -h db -U user -d mydb -t huge_table
Online table rewrite with minimal locking. Use for:
- Bloat reclamation.
- Adding partitioning to existing tables (pre-pg_partman 5).
- Major restructuring.
Not Alembic-native; run separately.
Replica considerations
Big migrations cause replica lag. For HA setups: monitor.
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;
Big spike during long migration → replica falls behind. Pause migration if too far behind.
Migration timeouts
def upgrade():
op.execute("SET LOCAL statement_timeout = '5min'")
op.execute("CREATE INDEX CONCURRENTLY ix_users_email ON users (email)")
Bound migration time. Better to fail and split than to lock for hours.
Reversibility
For prod: roll forward only. downgrade() is for dev convenience.
For “rollback”: write a forward migration that reverses what you did.
What I’d ship today
For non-trivial schema changes:
- Expand-contract discipline.
- CONCURRENTLY for indexes.
- NOT VALID + VALIDATE for FK.
- Batched backfill outside Alembic.
- Replica lag monitoring during long migrations.
- Forward-only in production.
What’s next
Chapter 6: multi-database / multi-tenant migrations.
Read this next
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .