Database migrations are where many teams trip. The framework runs the SQL; the discipline of writing safe SQL is on you. This post is the working playbook.

The expand-contract pattern

For breaking schema changes, never one-shot it. Two phases:

Phase 1 (expand):
- Add new column / table / index.
- Backfill from old.
- Code writes to BOTH old and new.
- Code reads from new (with fallback to old).

Phase 2 (contract):
- Code reads/writes new only.
- Drop old column / table / index.

Each phase is its own deploy. Between phases, the system is fully functional. If something goes wrong: stop and roll back.

Example: rename a column

Bad: ALTER TABLE users RENAME COLUMN email_addr TO email; — breaks all in-flight code.

Good:

-- Migration 1
ALTER TABLE users ADD COLUMN email text;
UPDATE users SET email = email_addr;  -- (in batches; see below)
-- Code now writes both, reads `email`
-- Wait, observe.

-- Migration 2 (after deploy stabilized)
ALTER TABLE users DROP COLUMN email_addr;

Two deploys. Each safe. No downtime.

Adding a NOT NULL column

Bad: ALTER TABLE users ADD COLUMN status text NOT NULL DEFAULT 'active'; — locks the table on big tables (writes all rows in one tx).

Postgres ≥ 11: DEFAULT doesn’t rewrite the table; only future inserts get the default. So ADD COLUMN ... DEFAULT 'active' NOT NULL is fast.

Otherwise:

-- 1. Add nullable
ALTER TABLE users ADD COLUMN status text;

-- 2. Backfill in batches (see below)

-- 3. Add NOT NULL after backfill
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Backfill in batches

-- BAD: locks for hours
UPDATE users SET status = 'active';

-- GOOD: batch
DO $$
DECLARE batch_size int := 5000;
DECLARE last_id bigint := 0;
DECLARE updated int;
BEGIN
    LOOP
        UPDATE users
        SET status = 'active'
        WHERE id IN (
            SELECT id FROM users WHERE id > last_id AND status IS NULL
            ORDER BY id LIMIT batch_size
        )
        RETURNING id INTO last_id;
        
        GET DIAGNOSTICS updated = ROW_COUNT;
        EXIT WHEN updated = 0;
        
        PERFORM pg_sleep(0.1);  -- breathing room
    END LOOP;
END $$;

Or do it from the application:

async def backfill():
    while True:
        rows = await db.fetch(
            "SELECT id FROM users WHERE status IS NULL ORDER BY id LIMIT 5000"
        )
        if not rows: break
        await db.execute(
            "UPDATE users SET status = 'active' WHERE id = ANY($1)",
            [r["id"] for r in rows]
        )
        await asyncio.sleep(0.5)

Indexes — concurrently

-- Locks the table:
CREATE INDEX idx ON huge_table (col);

-- Doesn't lock:
CREATE INDEX CONCURRENTLY idx ON huge_table (col);

Always CONCURRENTLY on production. Slower; doesn’t block writes.

For removing: DROP INDEX CONCURRENTLY.

Foreign keys

-- Locks for validation:
ALTER TABLE orders ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id);

-- Two-step (safer):
ALTER TABLE orders ADD CONSTRAINT fk_user FOREIGN KEY (user_id)
    REFERENCES users(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

NOT VALID adds the constraint without checking existing rows; VALIDATE checks them with a less-blocking lock.

Big table rebuilds (pg_repack)

For reclaiming bloat, changing primary key, partitioning a non-partitioned table:

pg_repack -h db -U user -d mydb -t huge_table

pg_repack rebuilds the table online with minimal locking. Use for:

  • Bloat from heavy UPDATE / DELETE.
  • Compaction after column drops.
  • Adding partitioning to existing tables (pre-pg_repack 1.5).

Requires the extension installed; takes time + temporary 2x disk.

Migration tools

LangNotes
AlembicPythonSQLAlchemy ecosystem
FlywayJVMMature, widely used
LiquibaseJVMXML/YAML; structured
golang-migrateGoSimple
sqlx migrateRustIf using SQLx
DieselRustDiesel ecosystem
Drizzle KitTSModern; auto-gen from schema
Prisma MigrateTS/JSPrisma ecosystem

All track migrations + versions; let you up/down. Pick the one that matches your stack.

Forward-only

In production, roll forward only. Down migrations are useful in dev; in prod, they often can’t faithfully reverse a migration that touched data.

For “rollback”: write a forward migration that reverses what you did, applied like any other migration. Treat it as new work, not “undo.”

Migrations in CI

- name: Apply migrations to test DB
  run: alembic upgrade head

- name: Run tests
  run: pytest

Catch broken migrations before merge.

Migrations on deploy

# In container start command:
alembic upgrade head && exec uvicorn app:app

Or run as a Job before the rolling deploy:

# Kubernetes Job that runs migrations
apiVersion: batch/v1
kind: Job
metadata: { name: migrate-{{ .Values.image.tag }} }
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myapp:{{ .Values.image.tag }}
          command: ["alembic", "upgrade", "head"]
      restartPolicy: OnFailure

Pre-deploy: run migrations. Then roll out new pods.

Online schema change tools (MySQL)

  • gh-ost (GitHub) — popular for MySQL.
  • pt-online-schema-change (Percona).

For Postgres: pg_repack + concurrent indexes cover most cases.

Dependent code changes

Coordinate code and schema:

Deploy 1: schema migration (expand) + code that writes to old AND new.
Deploy 2: code that reads new only.
Deploy 3: schema migration (contract).

Don’t combine schema migration + code change that depends on it in one deploy.

Common mistakes

1. Migration runs in deploy with code that depends on it

Migration ran on instance A; instance B is on old code; code uses old schema; B errors. Always expand-contract.

2. Big UPDATEs in one statement

Hours of lock; replicas lag; everything fails. Batch.

3. CREATE INDEX without CONCURRENTLY

Production downtime. Always concurrently.

4. Adding NOT NULL to existing column

ALTER TABLE users ALTER COLUMN x SET NOT NULL;  -- can take forever

Backfill first; then SET NOT NULL.

5. No migration testing

Migration has a bug; finds out in prod. Run migrations against a copy of prod data in CI.

What I’d ship today

For DB-heavy apps:

  • Alembic / Flyway as the migration tool.
  • Expand-contract discipline.
  • CONCURRENTLY for indexes.
  • Batched backfills.
  • Migrations as Jobs in K8s.
  • Forward-only in production.
  • Production-data-shaped staging for testing.

Read this next

If you want my migration playbook + Alembic templates, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .