Idempotency, Retries, and the Exactly-Once Illusion

In a distributed system, every network call has three outcomes: success, failure, or the dreaded unknown. The third one is where bugs live. This post is the production-quality patterns for handling it correctly.

We’ll cover idempotency keys (the way Stripe does it), retry budgets, the outbox pattern, and why “exactly-once delivery” is a marketing term that doesn’t survive contact with reality.

The fundamental problem

Client sends a request. Server processes it. Network drops the response. Now what?

Client    →    POST /payments    →    Server    (charged $100)
       ←    [TIMEOUT, RST, ECONNRESET, ...]    ←

Two interpretations, both consistent with what the client saw:

The request never reached the server. Retry to make sure.
The server processed it; only the response was lost. Retry → double charge.

The client cannot distinguish them. The fix is to make retries safe.

Idempotency keys (the Stripe pattern)

The client generates a unique key per logical operation and sends it on every retry:

POST /payments
Idempotency-Key: ord-2026-04-1234567
{ "amount": 10000, "currency": "INR" }

Server logic:

Look up Idempotency-Key. If found, return the stored response. Don’t re-process.
If not found, process and store (key, response, ttl).

The contract:

Client guarantees the key is unique per logical operation.
Server guarantees identical responses for identical keys.

That’s it. Retries are now safe.

Implementation

async def create_payment(req: Request, body: PaymentBody, idempotency_key: str):
    # Step 1: existence check + insert in one query
    cached = await db.fetchrow(
        """
        WITH ins AS (
            INSERT INTO idempotency (key, status, created_at)
            VALUES ($1, 'in_progress', now())
            ON CONFLICT (key) DO NOTHING
            RETURNING key
        )
        SELECT
            (SELECT key FROM ins) AS new_key,
            response_status, response_body
        FROM idempotency WHERE key = $1
        """,
        idempotency_key,
    )

    if cached and cached["new_key"] is None:
        # Existed before
        if cached["response_status"] is None:
            # Another request is in flight; tell the client to retry later
            raise HTTPException(409, "in_progress")
        return Response(content=cached["response_body"], status_code=cached["response_status"])

    # Process the actual payment
    result = await charge_card(body)

    # Persist response so retries return it
    await db.execute(
        """
        UPDATE idempotency
        SET status = 'completed', response_status = $2, response_body = $3
        WHERE key = $1
        """,
        idempotency_key, 200, json.dumps(result),
    )
    return result

Three production details:

The CTE is atomic — no race where two requests both think they’re new.
A row in 'in_progress' state catches concurrent duplicates (rare but real with retries during network blips).
The response body is stored verbatim. Retry sees byte-identical response — important for clients that hash responses.

Validation

Reject mismatched payloads: if the same key arrives with different body, return 422. Otherwise, a buggy client could overwrite a successful charge.
Bound the key length (e.g., 1–255 chars).
TTL: 24 hours is typical. Stripe uses 24h. After expiry, the same key works as new.

Retry strategies (client side)

Retries are not free. Three rules:

1. Exponential backoff with jitter

import random

async def call_with_retry(fn, max_attempts=5, base_ms=200, max_ms=10_000):
    for attempt in range(1, max_attempts + 1):
        try:
            return await fn()
        except RetryableError as e:
            if attempt == max_attempts:
                raise
            sleep_ms = min(max_ms, base_ms * 2 ** (attempt - 1))
            sleep_ms = random.uniform(0, sleep_ms)        # full jitter
            await asyncio.sleep(sleep_ms / 1000)

Without jitter, a fleet retries in unison — thundering herd against a recovering downstream. Jitter spreads them out.

2. Don’t retry non-retryable errors

HTTP code	Retry?
408 Request Timeout	Yes
425 Too Early	Yes
429 Too Many Requests	Yes (respect `Retry-After`)
500 Internal	Yes
502 Bad Gateway	Yes
503 Service Unavailable	Yes
504 Gateway Timeout	Yes
4xx (other)	No
Network errors	Yes

Retrying a 400 against your validation spends budget on something that will never succeed.

3. Bound the retries

A naive client can DDoS its own backend during a partial outage. Two limits:

Per-request attempts — usually 3–5.
Per-process retry budget — e.g., “no more than 10% of total requests can be retries.” When the budget is exhausted, fail fast.

Retry budgets are how Google’s gRPC clients prevent retry storms. Worth borrowing.

At-most, at-least, exactly-once

Three delivery semantics for messaging:

At-most-once. Send and pray. Fast. Loses messages on failure.
At-least-once. Retry until acknowledged. Never loses; may duplicate.
Exactly-once. “Each message is delivered exactly once.” Marketing copy.

There is no exactly-once over an unreliable network. What you can build is at-least-once delivery + idempotent processing, which together produce effectively-once outcomes. That’s the actual goal.

Kafka’s “exactly-once semantics” (EoS) is a constrained version: producer-to-broker idempotence + transactional consumer-to-broker writes within Kafka. It does not extend to your downstream HTTP call to Stripe. You still need application-level idempotency.

The outbox pattern

You’ve written a row in your DB and now need to publish an event. If you do them as separate operations, either can fail:

async with db.transaction():
    await db.execute("INSERT INTO orders ...")
await broker.publish("order.created", payload)         # ⛔ if this fails, no event

The fix:

async with db.transaction():
    await db.execute("INSERT INTO orders ...")
    await db.execute(
        "INSERT INTO outbox (topic, payload) VALUES ($1, $2)",
        "order.created", json.dumps(payload),
    )
# Separate worker reads outbox and publishes

A separate worker:

async def relay_outbox():
    while True:
        rows = await db.fetch(
            "SELECT id, topic, payload FROM outbox WHERE published_at IS NULL "
            "ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED"
        )
        for r in rows:
            await broker.publish(r["topic"], r["payload"])
            await db.execute("UPDATE outbox SET published_at = now() WHERE id = $1", r["id"])

Why this works:

The DB row + outbox row are in the same transaction. They commit or roll back together.
The relay is at-least-once. Consumers must be idempotent — but they should be anyway.
FOR UPDATE SKIP LOCKED lets multiple relay workers run in parallel.

This is the boring, correct pattern. It’s what every dual-write system should be doing.

Compensating actions (sagas)

When a multi-step workflow fails halfway, you can’t roll back across services (no distributed transactions). Instead, compensate:

Reserve inventory → Charge card → Ship
       ↓ fail
   Release inventory
                     ↓ fail
              Refund + Release inventory

Each step has an inverse. If step 3 fails, run inverses of 1 and 2.

Frameworks: Temporal, Cadence, Restate, AWS Step Functions. They orchestrate the saga, persist state, retry steps, drive compensations on failure. For any workflow with more than two steps that touch external services, use a saga framework, don’t roll your own.

Deduplication on the consumer side

When you consume a stream and might see duplicates:

async def handle(msg):
    seen = await db.fetchval("SELECT 1 FROM consumed WHERE event_id = $1", msg.event_id)
    if seen:
        return                                   # already processed
    async with db.transaction():
        await process_event(msg)
        await db.execute(
            "INSERT INTO consumed (event_id, ts) VALUES ($1, now())",
            msg.event_id,
        )
    await msg.ack()

Two important details:

The consumed insert is in the same transaction as the work.
The ack is after the commit. If we crash between commit and ack, the next consumer sees the duplicate but the dedup catches it.

For high-volume streams, a Bloom filter or a TTL’d sorted set in Redis is faster than a DB check. Trade some storage for some collision risk.

Operations and visibility

Whatever you build, instrument it:

Metric: count of duplicate-detected requests/messages. Spikes mean upstream is retrying more.
Metric: idempotency-key TTL evictions. Should be near zero in normal operation.
Metric: outbox lag (oldest unpublished row’s age). Page if over threshold.
Trace: correlate the original request, the retries, and the eventual success. Without traces, you’ll spend hours on phantom incidents.

Common mistakes

1. Idempotency without atomicity

existing = await db.fetchrow("SELECT * FROM idempotency WHERE key = $1", key)
if existing:
    return existing
# ... race window here ...
result = await process()
await db.execute("INSERT INTO idempotency ...", key, result)

Two concurrent retries with the same key both see “no existing” and both process. The check + insert must be atomic (CTE, upsert, advisory lock).

2. Retries on POST without idempotency keys

POST is not idempotent by default. Retrying without an idempotency mechanism is double charges, double emails, double everything.

3. No retry budget

A flaky downstream causes your service to retry 10×. CPU climbs. Connection pool fills. Now your service is the outage. Bound retries.

4. Outbox without indexing

SELECT * FROM outbox WHERE published_at IS NULL ORDER BY id becomes a sequential scan as the table grows. Partial index:

CREATE INDEX outbox_unpublished
  ON outbox (id) WHERE published_at IS NULL;

5. Ignoring “in progress” state

If a duplicate request arrives while the original is still processing, returning “404 not found” or letting it process again is a bug. Mark in_progress, return 409 to retries, or block until done.

A working checklist

For any production service that takes mutating requests:

Mutating endpoints accept Idempotency-Key.
Idempotency table with atomic upsert.
TTL on idempotency rows (24h typical).
Reject mismatched bodies with 422.
Outbox table for cross-system writes.
Outbox relay worker with FOR UPDATE SKIP LOCKED.
Consumer-side dedup on the events table.
Client retries: exponential backoff + jitter + budget.
Don’t retry 4xx (other than 408/425/429).
Metrics on duplicate count, outbox lag, retry rate.

Read this next

If you want a worked-out idempotency + outbox example for FastAPI and Postgres, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The fundamental problem#

Idempotency keys (the Stripe pattern)#

Implementation#

Validation#

Retry strategies (client side)#

1. Exponential backoff with jitter#

2. Don’t retry non-retryable errors#

3. Bound the retries#

At-most, at-least, exactly-once#

The outbox pattern#

Compensating actions (sagas)#

Deduplication on the consumer side#

Operations and visibility#

Common mistakes#

1. Idempotency without atomicity#

2. Retries on POST without idempotency keys#

3. No retry budget#

4. Outbox without indexing#

5. Ignoring “in progress” state#

A working checklist#

Read this next#