In a distributed system, every network call has three outcomes: success, failure, or the dreaded unknown. The third one is where bugs live. This post is the production-quality patterns for handling it correctly.
We’ll cover idempotency keys (the way Stripe does it), retry budgets, the outbox pattern, and why “exactly-once delivery” is a marketing term that doesn’t survive contact with reality.
The fundamental problem
Client sends a request. Server processes it. Network drops the response. Now what?
Client → POST /payments → Server (charged $100)
← [TIMEOUT, RST, ECONNRESET, ...] ←
Two interpretations, both consistent with what the client saw:
- The request never reached the server. Retry to make sure.
- The server processed it; only the response was lost. Retry → double charge.
The client cannot distinguish them. The fix is to make retries safe.
Idempotency keys (the Stripe pattern)
The client generates a unique key per logical operation and sends it on every retry:
POST /payments
Idempotency-Key: ord-2026-04-1234567
{ "amount": 10000, "currency": "INR" }
Server logic:
- Look up
Idempotency-Key. If found, return the stored response. Don’t re-process. - If not found, process and store
(key, response, ttl).
The contract:
- Client guarantees the key is unique per logical operation.
- Server guarantees identical responses for identical keys.
That’s it. Retries are now safe.
Implementation
async def create_payment(req: Request, body: PaymentBody, idempotency_key: str):
# Step 1: existence check + insert in one query
cached = await db.fetchrow(
"""
WITH ins AS (
INSERT INTO idempotency (key, status, created_at)
VALUES ($1, 'in_progress', now())
ON CONFLICT (key) DO NOTHING
RETURNING key
)
SELECT
(SELECT key FROM ins) AS new_key,
response_status, response_body
FROM idempotency WHERE key = $1
""",
idempotency_key,
)
if cached and cached["new_key"] is None:
# Existed before
if cached["response_status"] is None:
# Another request is in flight; tell the client to retry later
raise HTTPException(409, "in_progress")
return Response(content=cached["response_body"], status_code=cached["response_status"])
# Process the actual payment
result = await charge_card(body)
# Persist response so retries return it
await db.execute(
"""
UPDATE idempotency
SET status = 'completed', response_status = $2, response_body = $3
WHERE key = $1
""",
idempotency_key, 200, json.dumps(result),
)
return result
Three production details:
- The CTE is atomic — no race where two requests both think they’re new.
- A row in
'in_progress'state catches concurrent duplicates (rare but real with retries during network blips). - The response body is stored verbatim. Retry sees byte-identical response — important for clients that hash responses.
Validation
- Reject mismatched payloads: if the same key arrives with different body, return 422. Otherwise, a buggy client could overwrite a successful charge.
- Bound the key length (e.g., 1–255 chars).
- TTL: 24 hours is typical. Stripe uses 24h. After expiry, the same key works as new.
Retry strategies (client side)
Retries are not free. Three rules:
1. Exponential backoff with jitter
import random
async def call_with_retry(fn, max_attempts=5, base_ms=200, max_ms=10_000):
for attempt in range(1, max_attempts + 1):
try:
return await fn()
except RetryableError as e:
if attempt == max_attempts:
raise
sleep_ms = min(max_ms, base_ms * 2 ** (attempt - 1))
sleep_ms = random.uniform(0, sleep_ms) # full jitter
await asyncio.sleep(sleep_ms / 1000)
Without jitter, a fleet retries in unison — thundering herd against a recovering downstream. Jitter spreads them out.
2. Don’t retry non-retryable errors
| HTTP code | Retry? |
|---|---|
| 408 Request Timeout | Yes |
| 425 Too Early | Yes |
| 429 Too Many Requests | Yes (respect Retry-After) |
| 500 Internal | Yes |
| 502 Bad Gateway | Yes |
| 503 Service Unavailable | Yes |
| 504 Gateway Timeout | Yes |
| 4xx (other) | No |
| Network errors | Yes |
Retrying a 400 against your validation spends budget on something that will never succeed.
3. Bound the retries
A naive client can DDoS its own backend during a partial outage. Two limits:
- Per-request attempts — usually 3–5.
- Per-process retry budget — e.g., “no more than 10% of total requests can be retries.” When the budget is exhausted, fail fast.
Retry budgets are how Google’s gRPC clients prevent retry storms. Worth borrowing.
At-most, at-least, exactly-once
Three delivery semantics for messaging:
- At-most-once. Send and pray. Fast. Loses messages on failure.
- At-least-once. Retry until acknowledged. Never loses; may duplicate.
- Exactly-once. “Each message is delivered exactly once.” Marketing copy.
There is no exactly-once over an unreliable network. What you can build is at-least-once delivery + idempotent processing, which together produce effectively-once outcomes. That’s the actual goal.
Kafka’s “exactly-once semantics” (EoS) is a constrained version: producer-to-broker idempotence + transactional consumer-to-broker writes within Kafka. It does not extend to your downstream HTTP call to Stripe. You still need application-level idempotency.
The outbox pattern
You’ve written a row in your DB and now need to publish an event. If you do them as separate operations, either can fail:
async with db.transaction():
await db.execute("INSERT INTO orders ...")
await broker.publish("order.created", payload) # ⛔ if this fails, no event
The fix:
async with db.transaction():
await db.execute("INSERT INTO orders ...")
await db.execute(
"INSERT INTO outbox (topic, payload) VALUES ($1, $2)",
"order.created", json.dumps(payload),
)
# Separate worker reads outbox and publishes
A separate worker:
async def relay_outbox():
while True:
rows = await db.fetch(
"SELECT id, topic, payload FROM outbox WHERE published_at IS NULL "
"ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED"
)
for r in rows:
await broker.publish(r["topic"], r["payload"])
await db.execute("UPDATE outbox SET published_at = now() WHERE id = $1", r["id"])
Why this works:
- The DB row + outbox row are in the same transaction. They commit or roll back together.
- The relay is at-least-once. Consumers must be idempotent — but they should be anyway.
FOR UPDATE SKIP LOCKEDlets multiple relay workers run in parallel.
This is the boring, correct pattern. It’s what every dual-write system should be doing.
Compensating actions (sagas)
When a multi-step workflow fails halfway, you can’t roll back across services (no distributed transactions). Instead, compensate:
Reserve inventory → Charge card → Ship
↓ fail
Release inventory
↓ fail
Refund + Release inventory
Each step has an inverse. If step 3 fails, run inverses of 1 and 2.
Frameworks: Temporal, Cadence, Restate, AWS Step Functions. They orchestrate the saga, persist state, retry steps, drive compensations on failure. For any workflow with more than two steps that touch external services, use a saga framework, don’t roll your own.
Deduplication on the consumer side
When you consume a stream and might see duplicates:
async def handle(msg):
seen = await db.fetchval("SELECT 1 FROM consumed WHERE event_id = $1", msg.event_id)
if seen:
return # already processed
async with db.transaction():
await process_event(msg)
await db.execute(
"INSERT INTO consumed (event_id, ts) VALUES ($1, now())",
msg.event_id,
)
await msg.ack()
Two important details:
- The
consumedinsert is in the same transaction as the work. - The ack is after the commit. If we crash between commit and ack, the next consumer sees the duplicate but the dedup catches it.
For high-volume streams, a Bloom filter or a TTL’d sorted set in Redis is faster than a DB check. Trade some storage for some collision risk.
Operations and visibility
Whatever you build, instrument it:
- Metric: count of duplicate-detected requests/messages. Spikes mean upstream is retrying more.
- Metric: idempotency-key TTL evictions. Should be near zero in normal operation.
- Metric: outbox lag (oldest unpublished row’s age). Page if over threshold.
- Trace: correlate the original request, the retries, and the eventual success. Without traces, you’ll spend hours on phantom incidents.
Common mistakes
1. Idempotency without atomicity
existing = await db.fetchrow("SELECT * FROM idempotency WHERE key = $1", key)
if existing:
return existing
# ... race window here ...
result = await process()
await db.execute("INSERT INTO idempotency ...", key, result)
Two concurrent retries with the same key both see “no existing” and both process. The check + insert must be atomic (CTE, upsert, advisory lock).
2. Retries on POST without idempotency keys
POST is not idempotent by default. Retrying without an idempotency mechanism is double charges, double emails, double everything.
3. No retry budget
A flaky downstream causes your service to retry 10×. CPU climbs. Connection pool fills. Now your service is the outage. Bound retries.
4. Outbox without indexing
SELECT * FROM outbox WHERE published_at IS NULL ORDER BY id becomes a sequential scan as the table grows. Partial index:
CREATE INDEX outbox_unpublished
ON outbox (id) WHERE published_at IS NULL;
5. Ignoring “in progress” state
If a duplicate request arrives while the original is still processing, returning “404 not found” or letting it process again is a bug. Mark in_progress, return 409 to retries, or block until done.
A working checklist
For any production service that takes mutating requests:
- Mutating endpoints accept
Idempotency-Key. - Idempotency table with atomic upsert.
- TTL on idempotency rows (24h typical).
- Reject mismatched bodies with 422.
- Outbox table for cross-system writes.
- Outbox relay worker with
FOR UPDATE SKIP LOCKED. - Consumer-side dedup on the events table.
- Client retries: exponential backoff + jitter + budget.
- Don’t retry 4xx (other than 408/425/429).
- Metrics on duplicate count, outbox lag, retry rate.
Read this next
If you want a worked-out idempotency + outbox example for FastAPI and Postgres, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .