Agents fail in shapes single-LLM calls don’t. Loops, runaway tool calls, infinite “let me try one more thing.” This post is the working playbook for keeping agents bounded and recoverable.

Hard step caps

The single most important guardrail:

MAX_STEPS = 12

for step in range(MAX_STEPS):
    response = await llm.invoke(state)
    if response.is_done:
        return response
    state = await execute_tools(response.tool_calls)
else:
    raise AgentExceededSteps(f"didn't finish in {MAX_STEPS}")

If the agent doesn’t finish in 12 steps, stop. Document. Escalate.

For LangGraph , recursion_limit=12.

Tool error semantics

A tool error gives the model information. Useful errors:

  • ✅ “Order #42 not found. The user may have mistyped the ID.”
  • ✅ “Database temporarily unavailable. Try again in a moment.”
  • ✅ “You don’t have permission to view orders for tenant 7.”

Useless errors:

  • ❌ “Error: 500”
  • ❌ “Internal server error”
  • ❌ “Exception: NullPointerException at…”

Useful errors guide the model. Useless ones cause loops.

For tool design patterns .

Tool-level retries

For transient errors, retry inside the tool wrapper:

@tool
async def fetch_url(url: str) -> str:
    for attempt in range(3):
        try:
            return await httpx_get(url)
        except httpx.NetworkError:
            await asyncio.sleep(0.5 * 2 ** attempt)
    return f"Network error fetching {url} after 3 attempts. The site may be down."

The agent doesn’t know about the retries — it just gets the eventual result or a clean error message.

Fallback paths

For critical operations, define fallbacks:

async def search(query: str):
    try:
        return await primary_search(query)
    except SearchUnavailable:
        return await fallback_search(query)   # weaker but available

The agent never sees the failure. The system degrades gracefully.

Whole-agent retries

For full agent runs that fail:

async def run_agent_with_retry(input):
    for attempt in range(3):
        try:
            async with asyncio.timeout(120):
                return await agent.run(input)
        except (AgentExceededSteps, asyncio.TimeoutError):
            if attempt == 2:
                raise
            await asyncio.sleep(2 ** attempt)

Retry the whole flow. Useful for transient issues; useless for persistent ones.

Escalation

When the agent can’t proceed:

  • Human-in-the-loop: pause; ask the user.
  • Hand off to bigger model: Sonnet failed → try Opus.
  • Open a ticket: log the failed run; engineer follows up.
if attempts_exceeded:
    return AgentResult(
        status="failed",
        partial=last_response,
        escalation_url=create_ticket(input, history),
    )

For durable execution Temporal patterns help here.

Cost circuit breakers

async def run_agent_with_budget(input, max_cost_usd=1.0):
    cost = 0
    for step in range(MAX_STEPS):
        if cost > max_cost_usd:
            raise CostBudgetExceeded(cost)
        response = await llm.invoke(state)
        cost += response.usage.cost_usd
        ...

A bug in your prompt → agent loops 50 times → $20 burnt before you notice. Bound by cost.

State checkpointing

For long-running agents, persist state at each step:

@workflow.run
async def agent_workflow(self, input):
    state = await workflow.execute_activity(init_state, input)
    for step in range(MAX_STEPS):
        decision = await workflow.execute_activity(llm_decide, state)
        if decision.done:
            return decision.output
        state = await workflow.execute_activity(run_tool, decision.tool, decision.args)
        await workflow.sleep(timedelta(milliseconds=10))   # safe checkpoint

If the worker dies, Temporal resumes from the last activity.

Common failure modes

Loop on a missing field

The model expects “user.email” in tool output; tool returned “email”. Model says “let me try again with email.” Loops.

Fix: descriptive error or stable schema.

Loop on stale data

Cache returns yesterday’s value; model thinks it’s wrong; calls again. Same data. Loop.

Fix: agent has no way to bust the cache. Force-refresh tool. Or just bound steps.

Loop on auth

API returned 401; model retries with the same token; 401; loops.

Fix: don’t retry on 4xx. Surface to user.

Hallucinated tool

Model invents a tool that doesn’t exist. Tool dispatcher returns “unknown tool”; model retries with another invented name.

Fix: clear error: “Tool X doesn’t exist. Available tools: [list]. Use one of these or finish.”

Cost spiral

Each step generates a long thought trace. Per-step cost grows. Total balloons.

Fix: cost circuit breaker. Cap reasoning tokens.

Observability

Log every agent run:

  • Input.
  • Number of steps taken.
  • Tools called + outputs.
  • Final outcome.
  • Cost.
  • Latency.

Aggregate: which agents loop most? Which tools error most? Which prompts cost most?

For LLM observability .

What I’d ship today

For an agent product:

  1. MAX_STEPS = 12, hard.
  2. Cost cap per run ($1 default).
  3. Per-tool retries with sane backoff.
  4. Descriptive tool errors.
  5. Whole-agent fallback to bigger model on hard failures.
  6. Escalation hook to human / ticket on terminal failure.
  7. Observability on every run.

Boring. Bounded. Survivable.

Read this next

If you want my agent-error-recovery toolkit + observability dashboard, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .