By 2026 “workflow engine” is becoming as standard as “message queue” in production stacks. Temporal leads the category — $5B valuation in February, OpenAI / Block / Yum running it for mission-critical paths, and a model called durable execution that’s quietly reshaping how we build reliable systems.
This post is the working knowledge.
What durable execution is
Most application code looks like:
def process_order(order_id):
charge = stripe.charge(order_id)
inventory.reserve(order_id)
shipping.schedule(order_id)
notifier.send_confirmation(order_id)
Now ask: what happens if the process crashes between stripe.charge and inventory.reserve? Charged but no inventory reserved. You write retries. You write compensations. You write idempotency keys. You build a state machine in Postgres. You debug it for six months.
Durable execution flips this. The platform persists every step. If a worker crashes mid-flow, another worker resumes exactly where it left off. Your code looks like a regular function — the platform handles the durability:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_id: str):
await workflow.execute_activity(stripe_charge, order_id, schedule_to_close_timeout=60)
await workflow.execute_activity(reserve_inventory, order_id, schedule_to_close_timeout=30)
await workflow.execute_activity(schedule_shipping, order_id, schedule_to_close_timeout=30)
await workflow.execute_activity(send_confirmation, order_id, schedule_to_close_timeout=10)
This function runs once and to completion, even if the process running it dies a hundred times.
How it actually works
Temporal records an event history for each workflow:
1. WorkflowTaskScheduled
2. WorkflowTaskStarted → worker picked up
3. ActivityTaskScheduled → "stripe_charge"
4. ActivityTaskCompleted → result: { charge_id: "..." }
5. WorkflowTaskCompleted
6. ActivityTaskScheduled → "reserve_inventory"
... worker dies here ...
7. ActivityTaskTimeOut
8. ActivityTaskScheduled → retry
9. ActivityTaskCompleted
10. ...
When a worker resumes a workflow, it replays the history to rebuild the in-memory state. From the application’s perspective, time skipped forward; the function picks up after the last completed step.
The key constraint: workflow code must be deterministic. No datetime.now(), no random.random(), no I/O. Use workflow.now(), workflow.uuid4(), workflow.execute_activity() — Temporal-provided primitives that produce the same result on replay.
When to reach for Temporal
Strong fits:
- Payments + side effects. Charge, reserve, fulfill, notify. Each step matters.
- Agent orchestration. AI agent calls tools across systems with retries, timeouts, human-in-the-loop. (See AI Agents with LangGraph .)
- Long-running flows. Subscription renewals, fraud reviews that span days, multi-step ML training.
- Sagas. Multi-service operations with compensations on failure.
- Periodic jobs that must complete. Replace cron + queue + retry logic with one workflow definition.
Weak fits:
- Stateless request/response APIs.
- Pure data pipelines (use Airflow / Dagster).
- High-throughput simple jobs (a million-per-second event processor — use Kafka + plain consumers).
A working example (Python SDK)
# activities.py
from temporalio import activity
from typing import TypedDict
class ChargeResult(TypedDict):
charge_id: str
amount: int
@activity.defn
async def stripe_charge(order_id: str) -> ChargeResult:
# Real call to Stripe; raises on error → Temporal retries per policy
...
@activity.defn
async def reserve_inventory(order_id: str) -> None:
...
@activity.defn
async def refund_charge(charge_id: str) -> None:
...
# workflow.py
from datetime import timedelta
from temporalio import workflow
from temporalio.common import RetryPolicy
with workflow.unsafe.imports_passed_through():
from .activities import stripe_charge, reserve_inventory, refund_charge
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_id: str) -> str:
retry = RetryPolicy(maximum_attempts=5, initial_interval=timedelta(seconds=1))
charge = await workflow.execute_activity(
stripe_charge, order_id,
start_to_close_timeout=timedelta(seconds=60),
retry_policy=retry,
)
try:
await workflow.execute_activity(
reserve_inventory, order_id,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=retry,
)
except Exception as e:
# Compensation — refund the charge
await workflow.execute_activity(
refund_charge, charge["charge_id"],
start_to_close_timeout=timedelta(seconds=30),
)
raise
return f"order {order_id} fulfilled"
# worker.py
import asyncio
from temporalio.client import Client
from temporalio.worker import Worker
from .workflow import OrderWorkflow
from .activities import stripe_charge, reserve_inventory, refund_charge
async def main():
client = await Client.connect("temporal:7233")
worker = Worker(
client,
task_queue="orders",
workflows=[OrderWorkflow],
activities=[stripe_charge, reserve_inventory, refund_charge],
)
await worker.run()
if __name__ == "__main__":
asyncio.run(main())
# starter (e.g. from FastAPI)
client = await Client.connect("temporal:7233")
handle = await client.start_workflow(
OrderWorkflow.run,
order_id,
id=f"order-{order_id}",
task_queue="orders",
)
result = await handle.result()
That’s a complete durable order pipeline. The crash story is free — kill the worker mid-flow, restart, the workflow continues.
The saga pattern, simplified
The “if step 3 fails, compensate steps 2 and 1” pattern that takes hundreds of lines without a workflow engine becomes a try/except with await workflow.execute_activity(compensate_X, ...) calls in the except block. Temporal makes the compensation logic indistinguishable from happy-path logic.
I covered the principle in Idempotency, Retries, and Exactly-Once Illusions . Temporal automates 80% of the boilerplate.
Temporal for AI agents
Agentic workflows are perfect for Temporal:
- Long-running (LLM calls take seconds; workflows take minutes/hours).
- Multi-step with branches.
- Tool calls that fail and need retries.
- Human-in-the-loop pauses.
@workflow.defn
class ResearchAgent:
@workflow.run
async def run(self, question: str) -> str:
plan = await workflow.execute_activity(
llm_plan, question, start_to_close_timeout=timedelta(seconds=30),
)
results = []
for step in plan["steps"]:
r = await workflow.execute_activity(
run_tool, step,
start_to_close_timeout=timedelta(seconds=60),
retry_policy=RetryPolicy(maximum_attempts=3),
)
results.append(r)
if step.get("require_approval"):
# Pause until external signal
await workflow.wait_condition(lambda: self._approved or self._rejected)
if self._rejected:
return "user rejected"
answer = await workflow.execute_activity(
llm_synthesize, {"question": question, "results": results},
start_to_close_timeout=timedelta(seconds=60),
)
return answer
@workflow.signal
def approve(self):
self._approved = True
@workflow.signal
def reject(self):
self._rejected = True
_approved: bool = False
_rejected: bool = False
This single workflow has:
- LLM planning step.
- Tool execution loop with retries.
- Human approval (signal-based) for risky tool calls.
- Final synthesis.
If the worker crashes during step 3 of 7, a fresh worker resumes at step 3. State, including which tools have run, is reconstructed from the event history.
Temporal vs alternatives
| Temporal | Airflow | AWS Step Functions | Inngest | Restate | |
|---|---|---|---|---|---|
| Code-first | Yes | Python DSL | Visual / JSON | Yes | Yes |
| Long-running | Excellent | Limited | Excellent | Excellent | Excellent |
| Self-host | Yes | Yes | No | Yes (cloud + OSS) | Yes |
| AI agent fit | Excellent | Mediocre | Mediocre | Strong | Strong |
| SDKs | Go, Java, Python, TS, .NET, Ruby, PHP | Python | Many | TS first | Multi-language |
| Maturity | High | High | High | Mid | Newer |
Pick:
- Temporal for serious production workflows that span services and time.
- Airflow / Dagster for batch data pipelines (ETL).
- Step Functions if you’re fully on AWS and prefer managed.
- Inngest for TS-first lightweight workflows.
- Restate for the newest, simpler durable execution model with deeper distributed transactions story.
Operating Temporal
You can run Temporal three ways:
- Temporal Cloud — managed. ~$200+/month minimum but no ops.
- Self-hosted on Kubernetes — Helm chart works. Cassandra or Postgres backend.
- Temporalite / dev server — for local dev only.
Operationally:
- Scaling. Workers are horizontal; servers depend on the persistence backend.
- Observability. Built-in UI shows every workflow execution. Add OpenTelemetry for cross-service traces — see OpenTelemetry End-to-End .
- Versioning. Long-running workflows are sensitive to code changes. Use
workflow.patched()or “Versioning” to evolve workflow code safely. - Retention. Configure how long completed workflow histories are kept. Default 30 days.
Common mistakes
1. Doing I/O in workflow code
@workflow.run
async def run(self):
response = httpx.get("https://...") # ⛔ I/O in workflow
Workflow code must be deterministic. I/O goes in activities. Activities can do anything.
2. Using datetime.now()
now = datetime.now() # ⛔
now = workflow.now() # ✅
Workflow code must produce the same values on replay. Use Temporal-provided primitives.
3. No retry policy
await workflow.execute_activity(act, ...) # default = no retry
Set retry_policy=RetryPolicy(...). Otherwise transient failures kill the workflow.
4. Workflow code that grows unbounded history
A loop that runs forever creates an infinite event history. Use Continue-As-New to checkpoint and start fresh:
if iteration > 1000:
workflow.continue_as_new(...)
5. Treating it as a queue
A workflow per HTTP request creates serious overhead. Use Temporal for orchestration of multi-step work; for single-job fire-and-forget use a real queue. See Background Jobs in Python .
When I’d reach for it day one
- Building a payments product (subscriptions, refunds, partial fulfillment).
- Building an AI agent platform (multi-step, tool-using).
- Replacing a tangle of Celery + cron + state-in-Postgres that broke too often.
For a small SaaS without these shapes, defer adopting Temporal until you feel the pain it solves. Don’t add infrastructure pre-emptively.
Read this next
- Idempotency, Retries, and Exactly-Once Illusions — the fundamentals Temporal automates.
- Distributed Systems Fundamentals
- AI Agents with LangGraph in 2026 — the agent orchestration alternative.
- Background Jobs in Python
If you want a small Temporal + FastAPI starter that wires up a saga + AI agent workflow with OTel tracing, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .