Alembic Textbook Ch. 5 — Online Schema Changes (Zero-Downtime Patterns)

Chapter 5: zero-downtime migrations. The expand-contract pattern, batched backfills, concurrent indexes, big-table strategies. Postgres-focused (other DBs have similar patterns).

Expand-contract

For non-trivial schema changes (rename, type change, FK changes):

Phase 1 (expand): Add new column / table / index. Code writes both old and new. Reads new with fallback.

Phase 2 (contract): Code reads/writes new only. Drop old.

Each phase is its own migration + deploy. Between: system functional.

Example: rename column

Bad:

op.alter_column("users", "email", new_column_name="email_address")

In-flight requests using email break.

Good:

# Migration 1 (expand)
def upgrade():
    op.add_column("users", sa.Column("email_address", sa.String(255), nullable=True))
    op.execute("UPDATE users SET email_address = email")
    # Code writes both; reads from email_address with fallback

Wait, observe.

# Migration 2 (contract)
def upgrade():
    op.alter_column("users", "email_address", nullable=False)
    op.drop_column("users", "email")
    # Code now uses email_address only

Two deploys. No downtime.

NOT NULL on existing column

# Bad: hours of locking
op.alter_column("users", "status", nullable=False)

The DB validates every row. Rewrites the table for nullable changes (in older PG).

PG 11+: SET NOT NULL doesn’t rewrite. But still scans every row.

Better:

# 1. Add nullable
op.add_column("users", sa.Column("status", sa.String(20), nullable=True))

# 2. Backfill in batches (separate migration or one-off)
op.execute("UPDATE users SET status = 'active' WHERE status IS NULL")  # potentially fast if few NULLs after PG 11+

# 3. Mark NOT NULL
op.alter_column("users", "status", nullable=False)

For really big tables: backfill outside Alembic in chunks.

Adding NOT NULL with DEFAULT

PG 11+:

op.add_column("users", sa.Column("status", sa.String(20), nullable=False, server_default="active"))

Doesn’t rewrite the table. Existing rows get the default lazily. Fast.

For older PG: rewrites entire table; locks; bad.

Batched backfill

For huge tables, run outside Alembic:

async def backfill():
    while True:
        async with AsyncSessionLocal() as s:
            result = await s.execute(
                text("UPDATE users SET status = 'active' "
                     "WHERE id IN (SELECT id FROM users WHERE status IS NULL ORDER BY id LIMIT 5000)")
            )
            await s.commit()
            if result.rowcount == 0:
                break
        await asyncio.sleep(0.1)

Locks per-batch are tiny. No replica lag spike.

Concurrent indexes

def upgrade():
    op.execute("CREATE INDEX CONCURRENTLY ix_users_email ON users (email)")

def downgrade():
    op.execute("DROP INDEX CONCURRENTLY ix_users_email")

CONCURRENTLY can’t run in a transaction. Configure:

# In env.py or alembic.ini
[alembic]
transaction_per_migration = false

Or:

# In the migration
from alembic import op

def upgrade():
    op.execute("COMMIT")
    op.execute("CREATE INDEX CONCURRENTLY ...")

Be aware: a failed CONCURRENTLY index leaves an INVALID index. Drop manually before retrying.

NOT VALID FK

For adding FK to a big table:

# Bad
op.create_foreign_key("fk_orders_user", "orders", "users", ["user_id"], ["id"])
# locks for validation

# Better
op.create_foreign_key("fk_orders_user", "orders", "users", ["user_id"], ["id"], postgresql_not_valid=True)
# Then later:
op.execute("ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user")

NOT VALID adds the constraint without checking existing rows. VALIDATE CONSTRAINT checks them later (less locking).

Adding columns with computed defaults

op.add_column("posts", sa.Column("slug", sa.String(255)))
op.execute("UPDATE posts SET slug = lower(regexp_replace(title, '[^a-zA-Z0-9]+', '-', 'g'))")
op.alter_column("posts", "slug", nullable=False)
op.create_index("ix_posts_slug", "posts", ["slug"], unique=True)

Backfill, then constraint, then index.

For the unique constraint to succeed: slugs must be unique post-backfill. Validate in code.

Type changes

# Bad: rewrites table
op.alter_column("users", "id", type_=sa.BigInteger)

# Better
# 1. Add new column
op.add_column("users", sa.Column("id_new", sa.BigInteger))
# 2. Backfill in batches
# 3. Add unique constraint
# 4. Switch FKs to point to id_new
# 5. Drop old id
# 6. Rename

Multi-phase. Hours of orchestration. Worth it for big tables.

DROP in production

def upgrade():
    op.drop_column("users", "old_field")

Drops happen instantly (PG marks as deleted; vacuum reclaims). Safe.

But: if your code still references old_field, deploy fails. Sequence: deploy code that doesn’t use it; THEN drop.

pg_repack for big-table changes

For changes that need full table rewrite (e.g., changing primary key, adding column with non-default value):

pg_repack -h db -U user -d mydb -t huge_table

Online table rewrite with minimal locking. Use for:

Bloat reclamation.
Adding partitioning to existing tables (pre-pg_partman 5).
Major restructuring.

Not Alembic-native; run separately.

Replica considerations

Big migrations cause replica lag. For HA setups: monitor.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;

Big spike during long migration → replica falls behind. Pause migration if too far behind.

Migration timeouts

def upgrade():
    op.execute("SET LOCAL statement_timeout = '5min'")
    op.execute("CREATE INDEX CONCURRENTLY ix_users_email ON users (email)")

Bound migration time. Better to fail and split than to lock for hours.

Reversibility

For prod: roll forward only. downgrade() is for dev convenience.

For “rollback”: write a forward migration that reverses what you did.

What I’d ship today

For non-trivial schema changes:

Expand-contract discipline.
CONCURRENTLY for indexes.
NOT VALID + VALIDATE for FK.
Batched backfill outside Alembic.
Replica lag monitoring during long migrations.
Forward-only in production.

What’s next

Chapter 6: multi-database / multi-tenant migrations.

Read this next

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Expand-contract#

Example: rename column#

NOT NULL on existing column#

Adding NOT NULL with DEFAULT#

Batched backfill#

Concurrent indexes#

NOT VALID FK#

Adding columns with computed defaults#

Type changes#

DROP in production#

pg_repack for big-table changes#

Replica considerations#

Migration timeouts#

Reversibility#

What I’d ship today#

What’s next#

Read this next#