What's the most-skipped migration step?

Backfill in batches. Doing `UPDATE big_table SET new_col = ...` in one shot locks tables for hours. Always batch 1k–10k rows with pauses.

Should I use a migration framework?

Yes — Alembic / Flyway / golang-migrate / Diesel / sqlx-migrate / Prisma / Drizzle. They give versioning, ordering, and rollback. Don't roll your own.

Database Migrations in 2026 — Zero-Downtime Patterns That Actually Work

Database migrations are where many teams trip. The framework runs the SQL; the discipline of writing safe SQL is on you. This post is the working playbook.

The expand-contract pattern

For breaking schema changes, never one-shot it. Two phases:

Phase 1 (expand):
- Add new column / table / index.
- Backfill from old.
- Code writes to BOTH old and new.
- Code reads from new (with fallback to old).

Phase 2 (contract):
- Code reads/writes new only.
- Drop old column / table / index.

Each phase is its own deploy. Between phases, the system is fully functional. If something goes wrong: stop and roll back.

Example: rename a column

Bad: ALTER TABLE users RENAME COLUMN email_addr TO email; — breaks all in-flight code.

Good:

-- Migration 1
ALTER TABLE users ADD COLUMN email text;
UPDATE users SET email = email_addr;  -- (in batches; see below)
-- Code now writes both, reads `email`
-- Wait, observe.

-- Migration 2 (after deploy stabilized)
ALTER TABLE users DROP COLUMN email_addr;

Two deploys. Each safe. No downtime.

Adding a NOT NULL column

Bad: ALTER TABLE users ADD COLUMN status text NOT NULL DEFAULT 'active'; — locks the table on big tables (writes all rows in one tx).

Postgres ≥ 11: DEFAULT doesn’t rewrite the table; only future inserts get the default. So ADD COLUMN ... DEFAULT 'active' NOT NULL is fast.

Otherwise:

-- 1. Add nullable
ALTER TABLE users ADD COLUMN status text;

-- 2. Backfill in batches (see below)

-- 3. Add NOT NULL after backfill
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Backfill in batches

-- BAD: locks for hours
UPDATE users SET status = 'active';

-- GOOD: batch
DO $$
DECLARE batch_size int := 5000;
DECLARE last_id bigint := 0;
DECLARE updated int;
BEGIN
    LOOP
        UPDATE users
        SET status = 'active'
        WHERE id IN (
            SELECT id FROM users WHERE id > last_id AND status IS NULL
            ORDER BY id LIMIT batch_size
        )
        RETURNING id INTO last_id;
        
        GET DIAGNOSTICS updated = ROW_COUNT;
        EXIT WHEN updated = 0;
        
        PERFORM pg_sleep(0.1);  -- breathing room
    END LOOP;
END $$;

Or do it from the application:

async def backfill():
    while True:
        rows = await db.fetch(
            "SELECT id FROM users WHERE status IS NULL ORDER BY id LIMIT 5000"
        )
        if not rows: break
        await db.execute(
            "UPDATE users SET status = 'active' WHERE id = ANY($1)",
            [r["id"] for r in rows]
        )
        await asyncio.sleep(0.5)

Indexes — concurrently

-- Locks the table:
CREATE INDEX idx ON huge_table (col);

-- Doesn't lock:
CREATE INDEX CONCURRENTLY idx ON huge_table (col);

Always CONCURRENTLY on production. Slower; doesn’t block writes.

For removing: DROP INDEX CONCURRENTLY.

Foreign keys

-- Locks for validation:
ALTER TABLE orders ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id);

-- Two-step (safer):
ALTER TABLE orders ADD CONSTRAINT fk_user FOREIGN KEY (user_id)
    REFERENCES users(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

NOT VALID adds the constraint without checking existing rows; VALIDATE checks them with a less-blocking lock.

Big table rebuilds (pg_repack)

For reclaiming bloat, changing primary key, partitioning a non-partitioned table:

pg_repack -h db -U user -d mydb -t huge_table

pg_repack rebuilds the table online with minimal locking. Use for:

Bloat from heavy UPDATE / DELETE.
Compaction after column drops.
Adding partitioning to existing tables (pre-pg_repack 1.5).

Requires the extension installed; takes time + temporary 2x disk.

Migration tools

	Lang	Notes
Alembic	Python	SQLAlchemy ecosystem
Flyway	JVM	Mature, widely used
Liquibase	JVM	XML/YAML; structured
golang-migrate	Go	Simple
sqlx migrate	Rust	If using SQLx
Diesel	Rust	Diesel ecosystem
Drizzle Kit	TS	Modern; auto-gen from schema
Prisma Migrate	TS/JS	Prisma ecosystem

All track migrations + versions; let you up/down. Pick the one that matches your stack.

Forward-only

In production, roll forward only. Down migrations are useful in dev; in prod, they often can’t faithfully reverse a migration that touched data.

For “rollback”: write a forward migration that reverses what you did, applied like any other migration. Treat it as new work, not “undo.”

Migrations in CI

- name: Apply migrations to test DB
  run: alembic upgrade head

- name: Run tests
  run: pytest

Catch broken migrations before merge.

Migrations on deploy

# In container start command:
alembic upgrade head && exec uvicorn app:app

Or run as a Job before the rolling deploy:

# Kubernetes Job that runs migrations
apiVersion: batch/v1
kind: Job
metadata: { name: migrate-{{ .Values.image.tag }} }
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myapp:{{ .Values.image.tag }}
          command: ["alembic", "upgrade", "head"]
      restartPolicy: OnFailure

Pre-deploy: run migrations. Then roll out new pods.

Online schema change tools (MySQL)

gh-ost (GitHub) — popular for MySQL.
pt-online-schema-change (Percona).

For Postgres: pg_repack + concurrent indexes cover most cases.

Dependent code changes

Coordinate code and schema:

Deploy 1: schema migration (expand) + code that writes to old AND new.
Deploy 2: code that reads new only.
Deploy 3: schema migration (contract).

Don’t combine schema migration + code change that depends on it in one deploy.

Common mistakes

1. Migration runs in deploy with code that depends on it

Migration ran on instance A; instance B is on old code; code uses old schema; B errors. Always expand-contract.

2. Big UPDATEs in one statement

Hours of lock; replicas lag; everything fails. Batch.

3. CREATE INDEX without CONCURRENTLY

Production downtime. Always concurrently.

4. Adding NOT NULL to existing column

ALTER TABLE users ALTER COLUMN x SET NOT NULL;  -- can take forever

Backfill first; then SET NOT NULL.

5. No migration testing

Migration has a bug; finds out in prod. Run migrations against a copy of prod data in CI.

What I’d ship today

For DB-heavy apps:

Alembic / Flyway as the migration tool.
Expand-contract discipline.
CONCURRENTLY for indexes.
Batched backfills.
Migrations as Jobs in K8s.
Forward-only in production.
Production-data-shaped staging for testing.

Read this next

If you want my migration playbook + Alembic templates, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The expand-contract pattern#

Example: rename a column#

Adding a NOT NULL column#

Backfill in batches#

Indexes — concurrently#

Foreign keys#

Big table rebuilds (pg_repack)#

Migration tools#

Forward-only#

Migrations in CI#

Migrations on deploy#

Online schema change tools (MySQL)#

Dependent code changes#

Common mistakes#

1. Migration runs in deploy with code that depends on it#

2. Big UPDATEs in one statement#

3. CREATE INDEX without CONCURRENTLY#

4. Adding NOT NULL to existing column#

5. No migration testing#

What I’d ship today#

Read this next#