What is durable execution?

Code that survives any crash, restart, or network failure and resumes from exactly where it left off. The platform persists every step's intent and result; if a worker dies, another picks up from the same point. The application logic doesn't have to think about it.

When should I use Temporal vs a queue?

Queues handle one job. Temporal handles workflows — multi-step processes that span seconds to months, with retries, branches, and compensations. Use a queue (NATS / RabbitMQ / Celery) for fire-and-forget. Use Temporal when the workflow has more than two steps, dependencies between them, and must complete reliably.

Is Temporal overkill for small projects?

Often yes. For a small startup with a single Python service and a queue of background jobs, Celery / arq / Dramatiq is simpler. Reach for Temporal when you have payments, agent orchestration, multi-day flows, or you've felt the pain of writing your own retry / saga code.

Temporal and Durable Execution in 2026 — The Reliability Layer

By 2026 “workflow engine” is becoming as standard as “message queue” in production stacks. Temporal leads the category — $5B valuation in February, OpenAI / Block / Yum running it for mission-critical paths, and a model called durable execution that’s quietly reshaping how we build reliable systems.

This post is the working knowledge.

What durable execution is

Most application code looks like:

def process_order(order_id):
    charge = stripe.charge(order_id)
    inventory.reserve(order_id)
    shipping.schedule(order_id)
    notifier.send_confirmation(order_id)

Now ask: what happens if the process crashes between stripe.charge and inventory.reserve? Charged but no inventory reserved. You write retries. You write compensations. You write idempotency keys. You build a state machine in Postgres. You debug it for six months.

Durable execution flips this. The platform persists every step. If a worker crashes mid-flow, another worker resumes exactly where it left off. Your code looks like a regular function — the platform handles the durability:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_id: str):
        await workflow.execute_activity(stripe_charge, order_id, schedule_to_close_timeout=60)
        await workflow.execute_activity(reserve_inventory, order_id, schedule_to_close_timeout=30)
        await workflow.execute_activity(schedule_shipping, order_id, schedule_to_close_timeout=30)
        await workflow.execute_activity(send_confirmation, order_id, schedule_to_close_timeout=10)

This function runs once and to completion, even if the process running it dies a hundred times.

How it actually works

Temporal records an event history for each workflow:

1. WorkflowTaskScheduled
2. WorkflowTaskStarted        → worker picked up
3. ActivityTaskScheduled      → "stripe_charge"
4. ActivityTaskCompleted      → result: { charge_id: "..." }
5. WorkflowTaskCompleted
6. ActivityTaskScheduled      → "reserve_inventory"
   ... worker dies here ...
7. ActivityTaskTimeOut
8. ActivityTaskScheduled      → retry
9. ActivityTaskCompleted
10. ...

When a worker resumes a workflow, it replays the history to rebuild the in-memory state. From the application’s perspective, time skipped forward; the function picks up after the last completed step.

The key constraint: workflow code must be deterministic. No datetime.now(), no random.random(), no I/O. Use workflow.now(), workflow.uuid4(), workflow.execute_activity() — Temporal-provided primitives that produce the same result on replay.

When to reach for Temporal

Strong fits:

Payments + side effects. Charge, reserve, fulfill, notify. Each step matters.
Agent orchestration. AI agent calls tools across systems with retries, timeouts, human-in-the-loop. (See AI Agents with LangGraph .)
Long-running flows. Subscription renewals, fraud reviews that span days, multi-step ML training.
Sagas. Multi-service operations with compensations on failure.
Periodic jobs that must complete. Replace cron + queue + retry logic with one workflow definition.

Weak fits:

Stateless request/response APIs.
Pure data pipelines (use Airflow / Dagster).
High-throughput simple jobs (a million-per-second event processor — use Kafka + plain consumers).

A working example (Python SDK)

# activities.py
from temporalio import activity
from typing import TypedDict


class ChargeResult(TypedDict):
    charge_id: str
    amount: int


@activity.defn
async def stripe_charge(order_id: str) -> ChargeResult:
    # Real call to Stripe; raises on error → Temporal retries per policy
    ...


@activity.defn
async def reserve_inventory(order_id: str) -> None:
    ...


@activity.defn
async def refund_charge(charge_id: str) -> None:
    ...

# workflow.py
from datetime import timedelta
from temporalio import workflow
from temporalio.common import RetryPolicy

with workflow.unsafe.imports_passed_through():
    from .activities import stripe_charge, reserve_inventory, refund_charge


@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_id: str) -> str:
        retry = RetryPolicy(maximum_attempts=5, initial_interval=timedelta(seconds=1))

        charge = await workflow.execute_activity(
            stripe_charge, order_id,
            start_to_close_timeout=timedelta(seconds=60),
            retry_policy=retry,
        )

        try:
            await workflow.execute_activity(
                reserve_inventory, order_id,
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=retry,
            )
        except Exception as e:
            # Compensation — refund the charge
            await workflow.execute_activity(
                refund_charge, charge["charge_id"],
                start_to_close_timeout=timedelta(seconds=30),
            )
            raise

        return f"order {order_id} fulfilled"

# worker.py
import asyncio
from temporalio.client import Client
from temporalio.worker import Worker
from .workflow import OrderWorkflow
from .activities import stripe_charge, reserve_inventory, refund_charge


async def main():
    client = await Client.connect("temporal:7233")
    worker = Worker(
        client,
        task_queue="orders",
        workflows=[OrderWorkflow],
        activities=[stripe_charge, reserve_inventory, refund_charge],
    )
    await worker.run()


if __name__ == "__main__":
    asyncio.run(main())

# starter (e.g. from FastAPI)
client = await Client.connect("temporal:7233")
handle = await client.start_workflow(
    OrderWorkflow.run,
    order_id,
    id=f"order-{order_id}",
    task_queue="orders",
)
result = await handle.result()

That’s a complete durable order pipeline. The crash story is free — kill the worker mid-flow, restart, the workflow continues.

The saga pattern, simplified

The “if step 3 fails, compensate steps 2 and 1” pattern that takes hundreds of lines without a workflow engine becomes a try/except with await workflow.execute_activity(compensate_X, ...) calls in the except block. Temporal makes the compensation logic indistinguishable from happy-path logic.

I covered the principle in Idempotency, Retries, and Exactly-Once Illusions . Temporal automates 80% of the boilerplate.

Temporal for AI agents

Agentic workflows are perfect for Temporal:

Long-running (LLM calls take seconds; workflows take minutes/hours).
Multi-step with branches.
Tool calls that fail and need retries.
Human-in-the-loop pauses.

@workflow.defn
class ResearchAgent:
    @workflow.run
    async def run(self, question: str) -> str:
        plan = await workflow.execute_activity(
            llm_plan, question, start_to_close_timeout=timedelta(seconds=30),
        )

        results = []
        for step in plan["steps"]:
            r = await workflow.execute_activity(
                run_tool, step,
                start_to_close_timeout=timedelta(seconds=60),
                retry_policy=RetryPolicy(maximum_attempts=3),
            )
            results.append(r)

            if step.get("require_approval"):
                # Pause until external signal
                await workflow.wait_condition(lambda: self._approved or self._rejected)
                if self._rejected:
                    return "user rejected"

        answer = await workflow.execute_activity(
            llm_synthesize, {"question": question, "results": results},
            start_to_close_timeout=timedelta(seconds=60),
        )
        return answer

    @workflow.signal
    def approve(self):
        self._approved = True

    @workflow.signal
    def reject(self):
        self._rejected = True

    _approved: bool = False
    _rejected: bool = False

This single workflow has:

LLM planning step.
Tool execution loop with retries.
Human approval (signal-based) for risky tool calls.
Final synthesis.

If the worker crashes during step 3 of 7, a fresh worker resumes at step 3. State, including which tools have run, is reconstructed from the event history.

Temporal vs alternatives

	Temporal	Airflow	AWS Step Functions	Inngest	Restate
Code-first	Yes	Python DSL	Visual / JSON	Yes	Yes
Long-running	Excellent	Limited	Excellent	Excellent	Excellent
Self-host	Yes	Yes	No	Yes (cloud + OSS)	Yes
AI agent fit	Excellent	Mediocre	Mediocre	Strong	Strong
SDKs	Go, Java, Python, TS, .NET, Ruby, PHP	Python	Many	TS first	Multi-language
Maturity	High	High	High	Mid	Newer

Pick:

Temporal for serious production workflows that span services and time.
Airflow / Dagster for batch data pipelines (ETL).
Step Functions if you’re fully on AWS and prefer managed.
Inngest for TS-first lightweight workflows.
Restate for the newest, simpler durable execution model with deeper distributed transactions story.

Operating Temporal

You can run Temporal three ways:

Temporal Cloud — managed. ~$200+/month minimum but no ops.
Self-hosted on Kubernetes — Helm chart works. Cassandra or Postgres backend.
Temporalite / dev server — for local dev only.

Operationally:

Scaling. Workers are horizontal; servers depend on the persistence backend.
Observability. Built-in UI shows every workflow execution. Add OpenTelemetry for cross-service traces — see OpenTelemetry End-to-End .
Versioning. Long-running workflows are sensitive to code changes. Use workflow.patched() or “Versioning” to evolve workflow code safely.
Retention. Configure how long completed workflow histories are kept. Default 30 days.

Common mistakes

1. Doing I/O in workflow code

@workflow.run
async def run(self):
    response = httpx.get("https://...")        # ⛔ I/O in workflow

Workflow code must be deterministic. I/O goes in activities. Activities can do anything.

2. Using `datetime.now()`

now = datetime.now()                           # ⛔
now = workflow.now()                           # ✅

Workflow code must produce the same values on replay. Use Temporal-provided primitives.

3. No retry policy

await workflow.execute_activity(act, ...)      # default = no retry

Set retry_policy=RetryPolicy(...). Otherwise transient failures kill the workflow.

4. Workflow code that grows unbounded history

A loop that runs forever creates an infinite event history. Use Continue-As-New to checkpoint and start fresh:

if iteration > 1000:
    workflow.continue_as_new(...)

5. Treating it as a queue

A workflow per HTTP request creates serious overhead. Use Temporal for orchestration of multi-step work; for single-job fire-and-forget use a real queue. See Background Jobs in Python .

When I’d reach for it day one

Building a payments product (subscriptions, refunds, partial fulfillment).
Building an AI agent platform (multi-step, tool-using).
Replacing a tangle of Celery + cron + state-in-Postgres that broke too often.

For a small SaaS without these shapes, defer adopting Temporal until you feel the pain it solves. Don’t add infrastructure pre-emptively.

What durable execution is#

How it actually works#

When to reach for Temporal#

A working example (Python SDK)#

The saga pattern, simplified#

Temporal for AI agents#

Temporal vs alternatives#

Operating Temporal#

Common mistakes#

1. Doing I/O in workflow code#

2. Using datetime.now()#

3. No retry policy#

4. Workflow code that grows unbounded history#

5. Treating it as a queue#

When I’d reach for it day one#

Read this next#