Most “distributed systems” problems are the same handful of tradeoffs reshuffled. Once you internalize them, system design stops being a memory test and starts being applied judgment. This is the mental model I wish I’d had on day one.
Rule zero: partial failure is normal
Single-machine programs have two states: working or crashed. Distributed programs have a third: half-working.
- A network call that succeeded on one node but the response was lost.
- A leader that thinks it’s still leader after a partition.
- Three replicas with three different opinions on what the latest value is.
Every design decision below exists because partial failure is real. If your design doesn’t have a story for “what if this RPC neither succeeded nor failed cleanly,” it’s not done.
CAP — and why people misuse it
CAP says: in the presence of a network partition (P), you must choose either consistency (C) or availability (A). You cannot have both.
Two clarifications most people miss:
- Partitions happen. You don’t get to “not choose P.” The decision is what to do when a partition occurs.
- CAP is about strict (linearizable) consistency. Most real systems are AP at the linearizable level and CP at a weaker level (eventual consistency, read-your-writes). The dichotomy is a teaching tool, not a switch.
In 2026 the more useful framing is PACELC: in case of Partition, Availability or Consistency; Else (no partition), Latency or Consistency. Even when nothing’s on fire, the consistency dial costs latency.
Replication
You have one machine. It dies. You lose data. Solution: replicate.
Three patterns:
Single-leader (most systems)
One node accepts writes, replicates to followers. Examples: classic Postgres, MySQL, Redis primary/replica.
- Pros: Simple model. Reads from any replica. No write conflicts.
- Cons: Leader is a bottleneck. Failover is a moment — the cluster is down briefly. Followers can lag.
Multi-leader
Multiple nodes accept writes, replicate to each other. Examples: Cassandra (in some configs), Couchbase, multi-region active-active.
- Pros: Higher write availability. No single failover. Geo-local writes.
- Cons: Conflicts must be resolved. CRDTs or LWW (last-write-wins) get complex fast.
Leaderless
Every node accepts every operation; quorums decide. Examples: DynamoDB, Cassandra, Riak.
- Pros: Maximum availability. Smooth scaling.
- Cons: Tunable but tricky consistency. Read-your-writes requires care.
90% of services run on single-leader and are happy. Only reach for multi-leader/leaderless when you need it — usually for write availability across regions.
Partitioning (sharding)
You have more data than one machine can hold. Solution: split.
Two strategies:
Hash partitioning
shard = hash(key) % N. Even distribution, no hot spots.
- Pros: Simple, balanced.
- Cons: Range queries become “ask everyone.” Resharding is painful.
Range partitioning
Keys fall into ordered ranges, each on a node.
- Pros: Range scans are fast. You can rebalance smoothly.
- Cons: Hot spots are possible if access skews to a range.
Consistent hashing is the industry’s answer to “hash partitioning makes resharding painful.” Each key hashes onto a virtual ring; each node owns an arc; adding a node only moves a small slice. Read Caching Strategies for the cache flavor; the principle is the same.
In practice, sharded systems use hash partitioning + secondary indexes for the few queries that need range. Pure range partitioning is reserved for time-series data or naturally ordered keys.
Consistency models, simplest to strongest
Most arguments about “consistency” are arguments about which of these you’re talking about.
1. Eventual
“Given enough time with no writes, replicas converge.” Weakest useful model. DNS, Cassandra defaults, S3 list-after-write (historically).
2. Read-your-writes
“After I write, my own reads see my write.” The minimum a sane app needs for user experience. Usually achieved by routing a user’s reads to the same replica they wrote to (sticky sessions) or reading from leader.
3. Monotonic reads
“I never see time go backwards.” Once I read v3, I don’t subsequently read v2. Usually a session-scoped guarantee.
4. Causal
“If A happened-before B, all observers see A before B.” Vector clocks; CRDTs lean here.
5. Sequential / linearizable
“Operations appear to happen one at a time, in some single global order.” Atomic counters, leader elections, financial transactions. Expensive — requires consensus (Raft/Paxos).
The art is choosing the weakest model that still satisfies the use case. Linearizable everywhere is for tutorials, not for systems that scale.
Time, clocks, and ordering
Two laws:
- Wall clocks lie. NTP drift, leap seconds, VM pause skew. You cannot use a
Date.now()to say “this happened first.” - Logical clocks are how you order distributed events. Lamport timestamps, vector clocks, hybrid logical clocks (HLCs).
Spanner’s TrueTime is the famous exception: it bounds clock uncertainty with GPS + atomic clocks. Outside Google scale, treat physical time as advisory.
For ordering events across services, use:
- Per-record version numbers (optimistic concurrency) for “did this change between read and write?”
- Snowflake / KSUID / UUIDv7 for sortable, time-ordered IDs.
- Vector clocks when you need to reason about cause and effect across nodes.
The three patterns that come up everywhere
1. Idempotency
Network can deliver a request 0, 1, or many times. The receiver must produce the same outcome regardless. The mechanism is an idempotency key — client supplies a unique ID per logical operation, server dedups.
POST /payments
Idempotency-Key: ord-42-charge-1
{ "amount": 1000 }
Server stores (key, response) for some retention window. Repeated request → same response, no duplicate charge.
I’ll cover this in depth in Idempotency, Retries, and Exactly-Once Illusions .
2. Outbox / saga
You wrote to your DB and now need to publish an event. If you do them as separate operations, either can fail. The fix:
- Outbox: write the event into an
outboxtable in the same DB transaction. A separate worker reads outbox rows and publishes to the broker, marking them dispatched. - Saga: long-running workflow modeled as a chain of steps, each compensated by an inverse step on failure.
These patterns make distributed mutations safe without distributed transactions (which mostly don’t exist outside of a few systems).
3. Backpressure
Producers can outrun consumers. Without backpressure, queues grow unbounded; latency explodes; the system collapses. Mechanisms:
- Bounded queues (block when full)
- Rate limiting (token buckets)
- Load shedding (drop requests at the edge)
- Adaptive concurrency (TCP-style AIMD)
Every production-quality system has explicit backpressure somewhere. Implicit backpressure (= unbounded queues) is a footgun.
Two failure modes you’ll meet
Cascading failure
Service A retries failures against B. B is overloaded. A’s retries pile on. B gets worse. A’s queue fills. A goes down. Now C and D, which depend on A, fail.
Defenses: circuit breakers (stop calling B when failure rate is high), deadline propagation (cancel work whose budget is spent), bulkheads (separate connection pools so one bad downstream can’t sink the whole service).
Thundering herd
A cache key expires; 1000 clients miss simultaneously; 1000 requests hit the database. Defenses: request coalescing (only one fetch in flight per key), probabilistic early expiration, stale-while-revalidate.
What I’d actually study
If you’re prepping a system design interview or designing a real system, the working set:
- HTTP, TCP, DNS at the level where you can debug them.
- One DB deeply — typically PostgreSQL. Read-replicas, indexing, transactions, isolation levels.
- One cache — Redis/Valkey. TTLs, eviction, hot-key handling.
- One message broker — Kafka, NATS, or RabbitMQ. See Choosing a Message Queue .
- CAP, PACELC, consistency models, partitioning, replication. This post.
- Idempotency and outbox. See linked posts.
- Observability: SLOs, error budgets, traces. See Observability .
Master these and you can reason about almost any distributed system you’ll meet.
Read this next
If you want a curated reading list and worked-out design problems with diagrams, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .