“How do I make a multi-service operation atomic?” — eventually you get there. The honest answer: you don’t. You compensate. This post is the working set for cross-service consistency.
Why 2PC mostly doesn’t fit
Two-phase commit:
Coordinator: "Prepare?" → all services
Each service: ← "Prepared" / "Abort"
Coordinator: "Commit?" → all services
Each service: ← "Committed"
Issues:
- Every participant must support 2PC. Most modern services (Stripe, SendGrid, your own HTTP services) don’t.
- Blocking: if the coordinator dies between prepare and commit, participants are stuck holding locks.
- Tight coupling: participants must trust each other’s commit.
- Performance: extra round trips per transaction.
For inside-the-DB transactions across shards: 2PC works. For cross-service distributed transactions: rarely.
Sagas: the alternative
A saga is a series of local transactions, each with a compensating action that reverses it.
Order saga:
1. Reserve inventory → compensate: release inventory
2. Charge payment → compensate: refund
3. Ship order → compensate: cancel shipment
4. Send confirmation → compensate: send cancellation
If step 3 fails: run compensations for 2 and 1.
Orchestration
A central coordinator drives the saga:
async def order_saga(order_id):
state = SagaState(order_id)
try:
state.inventory = await reserve_inventory(order)
state.payment = await charge_payment(order)
state.shipment = await ship_order(order)
await send_confirmation(order)
except Exception:
await compensate(state)
raise
Pros: explicit; debuggable; one place to read the flow.
Cons: coordinator is a hot path; must persist state.
Choreography
Services react to events:
order.created → InventoryService listens → reserves
→ emits inventory.reserved
inventory.reserved → PaymentService listens → charges
→ emits payment.captured
...
Pros: loosely coupled; each service self-contained.
Cons: hard to debug; flow is implicit; “who reacts to what” scattered.
For most teams: orchestration is the default. Choreography for high-scale or when coupling truly hurts.
Persistent saga state
CREATE TABLE sagas (
id uuid PRIMARY KEY,
type text NOT NULL,
state text NOT NULL, -- 'pending', 'compensating', 'completed', 'failed'
payload jsonb NOT NULL,
started_at timestamptz DEFAULT now(),
updated_at timestamptz DEFAULT now()
);
CREATE TABLE saga_steps (
saga_id uuid REFERENCES sagas(id),
step int NOT NULL,
name text NOT NULL,
status text NOT NULL, -- 'pending', 'completed', 'failed', 'compensated'
result jsonb,
PRIMARY KEY (saga_id, step)
);
After every step, persist. On crash: rehydrate, continue from where you stopped.
Compensations need idempotency
Compensation might run twice (retry, partial failure). Make sure refunding twice doesn’t double-refund:
async def refund_payment(payment_id, idempotency_key):
if await already_refunded(payment_id):
return
await stripe.Refund.create(
payment_intent=payment_id,
idempotency_key=idempotency_key,
)
await mark_refunded(payment_id)
See Idempotency .
Compensations aren’t always possible
What if “ship order” already happened and the truck has left?
Options:
- Mark for return (asynchronous compensation).
- Refund and apologize (semantic compensation).
- Prevent the failure modes that would require non-recoverable compensation (validate harder upfront).
Compensations are best-effort. Some operations can’t be undone. Plan for it.
Outbox pattern
Combine local transaction + event publishing atomically:
BEGIN;
INSERT INTO orders ...;
INSERT INTO outbox (event_type, payload) VALUES ('order.created', ...);
COMMIT;
A worker reads from outbox and publishes to your event bus:
async def outbox_worker():
while True:
events = await db.fetch("SELECT * FROM outbox WHERE published_at IS NULL LIMIT 100")
for e in events:
await bus.publish(e.event_type, e.payload)
await db.execute("UPDATE outbox SET published_at = now() WHERE id = $1", e.id)
await asyncio.sleep(0.1)
Either both happen or neither (the local transaction guarantees it). Event delivery is async but reliable.
Inbox pattern
Mirror on the consumer side:
CREATE TABLE inbox (
id uuid PRIMARY KEY,
received_at timestamptz DEFAULT now(),
processed_at timestamptz
);
INSERT INTO inbox (id) VALUES ($1) ON CONFLICT DO NOTHING RETURNING id;
-- if returned, it's new; process. if not, already seen.
Idempotent consume — same event delivered twice, processed once.
Workflow engines
For complex multi-step operations:
- Temporal: durable workflows; built-in retries, timers, signals.
- Cadence: predecessor of Temporal.
- Camunda: BPMN-based.
- AWS Step Functions: managed.
These do the saga state management for you. See Temporal Workflow Engine .
Event sourcing fit
Sagas pair well with event sourcing — saga state IS a stream of events. See Event Sourcing .
Common mistakes
1. 2PC across services
You make all services pretend to support 2PC; coordinator failures lock everything; reliability tanks. Use sagas.
2. No compensation for some steps
“This step always succeeds.” Until it doesn’t. Plan compensation for every step.
3. Compensations not idempotent
Retry refunds → double refund. Always idempotency-key.
4. No persistent saga state
Process crashes mid-saga; state lost; orphaned partial transactions. Persist after every step.
5. Optimism
“Network never fails between us.” It will. Build for retries from day one.
What I’d ship today
For multi-service operations:
- Saga (orchestration) for clarity.
- Persistent saga state in your DB.
- Idempotent compensations.
- Outbox pattern for atomic local-write + event-emit.
- Temporal if the workflows get complex.
- Tracing through the whole saga via distributed tracing .
Read this next
- Idempotency, Retries, and Exactly-Once Illusions
- Event Sourcing 2026
- Temporal Workflow Engine 2026
- Distributed Tracing 2026
If you want my saga + outbox starter (Postgres + Python), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .