Design a URL Shortener — A Complete System Design Walkthrough

URL shortener is the canonical system design interview question because it touches every interesting topic: ID generation, caching, schema design, capacity, analytics, abuse. Here’s how I’d actually design it — interview answer or production service, the structure is the same.

Requirements (this is the trick)

The first job is gathering requirements. Fail here and the rest is wasted.

Functional

Given a long URL, return a short URL.
Given a short URL, redirect to the long URL.
Optional: custom alias, expiration, click analytics.

Non-functional

Read-heavy. Roughly 100:1 reads to writes for typical shortener.
Low latency redirect — sub-100ms p99.
High availability — broken redirects look broken.
Durable — losing a mapping is unacceptable.

Out of scope

User accounts, OAuth, billing UI. Pretend these exist via a sibling service.

Back-of-envelope capacity

	Number
New URLs per day	100M
Reads per day	10B (100:1 ratio)
New URLs / sec (avg)	~1,200
Reads / sec (avg)	~120k
Reads / sec (peak, 3×)	~360k
URL row size	~500 bytes (URL + metadata)
Storage / year	100M × 365 × 500B ≈ 18 TB/year

Conclusion: writes are easy. Reads are not. The whole design pivots around making reads cheap.

API

POST /api/shorten
  body: {"url": "https://example.com/...", "ttl_days": 365}
  → 201 {"short": "https://r.pt/aB3xQ"}

GET /{code}
  → 301 Location: <long_url>     (or 404 / 410 expired)

Why 301 and not 302? 301 is permanent and cacheable — clients and CDN cache forward. For a service handling 10B redirects/day, that’s enormous. Use 301 by default; switch to 302 only if you need real-time analytics on every click.

ID generation — pick wisely

The short code is just an ID encoded into a short string. Three patterns:

1. Random hash, retry on collision

def shorten(url):
    for _ in range(5):
        code = secrets.token_urlsafe(6)[:7]
        if try_insert(code, url):
            return code
    raise SystemError

Pros: Stateless. Trivial to scale.
Cons: Collisions cost an extra DB call. At 100M URLs and 7-char base62 (~3.5T space), collision probability is low but nonzero.

2. Hash of URL (deterministic)

code = base62(sha256(url + salt))[:7]

Pros: Same URL → same short. Caches well.
Cons: Same URL → same short, even if user wanted a fresh one. Salt-and-rehash on collision still requires DB check.

3. Counter + base62

A monotonic counter (or batched range) per shard, encoded base62 to make it short.

n = next_counter()       # 1, 2, 3, ...
code = base62(n)         # "1", "2", ..., "Z", "10", ..., "aB3xQ"

Pros: No collisions. Sequential. Compresses well.
Cons: Counter is global state. Snowflake-style (timestamp + machine ID + sequence) sidesteps that.

My pick: counter-based with a Snowflake-like generator. Each app server gets a machine ID, generates (timestamp << 22) | (machine_id << 12) | sequence, encodes base62. No coordination on the hot path.

def base62(n: int) -> str:
    A = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    out = []
    while n:
        n, r = divmod(n, 62)
        out.append(A[r])
    return "".join(reversed(out))

7 base62 chars = 62⁷ ≈ 3.5T codes. Plenty for a long time.

Schema

CREATE TABLE urls (
    code         TEXT PRIMARY KEY,
    long_url     TEXT NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    expires_at   TIMESTAMPTZ,
    user_id      BIGINT,
    is_active    BOOLEAN NOT NULL DEFAULT true
);

CREATE INDEX urls_user ON urls (user_id);
CREATE INDEX urls_expires ON urls (expires_at) WHERE is_active = true;

code as primary key gets us a free unique constraint and the most common query (lookup by code) is an index hit.

Read path (the hot path)

client → CDN → app server → cache → database

Layered caching:

CDN cache (1 hour) — most popular URLs served entirely from edge. Free.
Redis (24 hour) — hot working set in memory.
Database — the source of truth.

async def lookup(code: str) -> str | None:
    cached = await redis.get(f"u:{code}")
    if cached:
        return cached.decode()
    if cached == b"":                # negative caching for not-found
        return None

    row = await db.fetchrow("SELECT long_url FROM urls WHERE code = $1 AND is_active", code)
    if row is None:
        await redis.set(f"u:{code}", "", ex=60)   # cache miss for 1 min
        return None
    await redis.set(f"u:{code}", row["long_url"], ex=86400)
    return row["long_url"]

Two production patterns most tutorials skip:

Negative caching for 404s prevents brute-force scanning from hammering the DB.
TTL ≠ delete-on-expiry. Long TTLs are fine when you have invalidation logic; they’re risky when you don’t.

Write path

async def shorten(url: str, ttl_days: int = 365) -> str:
    code = base62(snowflake_id())
    await db.execute(
        "INSERT INTO urls (code, long_url, expires_at) VALUES ($1, $2, $3)",
        code, url, now() + timedelta(days=ttl_days),
    )
    await redis.set(f"u:{code}", url, ex=86400)     # warm cache on insert
    return code

Warming the cache on insert pays off when users immediately share the link.

Capacity again, sharper

Reads at 360k/s peak. With a 95% cache hit ratio:

5% miss → 18k/s to DB.
A single Postgres can do that on a primary key lookup. Read-replicas if needed.

Storage: 18TB/year. Postgres can hold that with partitioning, but at multi-year scales, consider:

Cold-data archive: URLs not hit in N days → S3 + parquet, served via a fallback path.
Sharded Postgres or a managed KV store (DynamoDB, Spanner, FoundationDB).

Analytics

Don’t write analytics on the hot path. The redirect should be one DB read, no write.

Pattern:

client → app → emit click event to Kafka
                           ↓
                      analytics consumer
                           ↓
                   write to ClickHouse / BigQuery

The redirect path stays sub-millisecond. Analytics can lag a few seconds — nobody clicks “show me my clicks” expecting microsecond freshness.

Abuse and security

A URL shortener is a phishing/spam vector. Defenses:

URL validation at creation: reject non-http(s), malformed, very long URLs.
Block lists for known-bad domains (Google Safe Browsing API).
Rate limit on the create endpoint per IP and per user.
CAPTCHA for anonymous creates above a threshold.
Click-time scanning — at redirect, check if domain is on a recent block list; if so, show an interstitial warning rather than blind redirect.

Custom aliases

POST /api/shorten
  body: {"url": "...", "alias": "blog-post-42"}

Same flow but the user picks the code. Reserve a separate namespace so customs can’t collide with auto-generated codes:

def is_valid_alias(s: str) -> bool:
    return 4 <= len(s) <= 32 and re.fullmatch(r"[a-zA-Z0-9-]+", s) and "-" in s

The “must contain a hyphen” rule is a cheap separator: aB3xQ is auto, my-link is custom. They can’t collide.

Multi-region

For a global service:

Read replicas in every region (Postgres logical replication or a managed multi-region DB).
Writes to a primary region; reads from local replica.
CDN at the edge — serves most redirects without hitting any backend.
Asynchronous propagation for analytics.

The interesting tradeoff: write linearizability across regions costs you write availability during partitions. For a URL shortener, eventual consistency on creation is fine — a fresh link being briefly invisible in another region is acceptable.

What interviewers love to dig into

“How do you handle a celebrity click storm?” → CDN, request coalescing, pre-warm cache when traffic spikes detected.
“How do you migrate the schema?” → Add column nullable; backfill in batches; switch reads; drop default.
“What if Redis goes down?” → DB has the answer, slightly slower; circuit-break Redis to fail fast; consider a second cache layer.
“How do you reach 1M shortens/sec?” → Snowflake IDs (no coord), partition the urls table on code, queue inserts into the analytics path.

What I’d actually do day one

For a production URL shortener built today:

Postgres (with urls partitioned by code hash) for source of truth.
Redis/Valkey as the cache.
A small Go or Rust service for the redirect (latency budget is tight).
Cloudflare in front (CDN + DDoS).
Kafka → ClickHouse for click analytics.
CSP and Safe Browsing checks for security.

That’s a system that scales from 0 to 1B redirects/day with the same architecture.

Requirements (this is the trick)#

Functional#

Non-functional#

Out of scope#

Back-of-envelope capacity#

API#

ID generation — pick wisely#

1. Random hash, retry on collision#

2. Hash of URL (deterministic)#

3. Counter + base62#

Schema#

Read path (the hot path)#

Write path#

Capacity again, sharper#

Analytics#

Abuse and security#

Custom aliases#

Multi-region#

What interviewers love to dig into#

What I’d actually do day one#

Read this next#