In 2026 every serious LLM application sits behind an AI gateway. The gateway gives you one API surface for many providers, automatic fallback, cost tracking, caching, rate limiting, observability, and a clean cost-attribution layer. This post is the working guide to picking and using one.
What an AI gateway does
┌──────────────────────────┐
App ──────▶│ AI Gateway │
│ • Auth & rate limit │
│ • Routing & fallback │
│ • Caching │
│ • Cost tracking │
│ • Observability │
└──────┬───────────────────┘
│
┌─────────────┼──────────────┐
▼ ▼ ▼
Anthropic OpenAI Google
/ vLLM / Together
Your app talks to one URL. The gateway:
- Routes — picks the right provider/model per request.
- Falls back — if Anthropic is down, retry with OpenAI.
- Caches — same prompt → cached response.
- Rate limits — per team, per model, per dollar.
- Tracks — every call logged for billing and debugging.
- Standardizes — OpenAI-compatible API, regardless of upstream.
The “OpenAI-compatible façade” is the common shape: gateways accept the Chat Completions API; your app uses the OpenAI SDK with base_url= pointing at the gateway.
Why you want one
Even for a single-provider single-model app:
- Caching — repeated prompts hit cache, cut cost 30–80%.
- Retry/fallback — one provider being down doesn’t take you down.
- Observability — every call traced; per-user / per-feature attribution.
- Cost control — set hard limits per team / API key.
- Rate limiting — protect your own service from upstream spikes.
- Easy provider swap — when a model retires, change config not code.
The cost is one extra hop (~5–20ms). For most LLM apps this is rounding error.
The contenders
LiteLLM (open source, self-host or managed)
- Best for: Teams that want control. Self-host on a small VM or Kubernetes.
- Strengths: Supports 100+ providers, OpenAI-compatible, virtual keys, rate limit + cost budget per key.
- Weaknesses: Operate it yourself; UI is functional, not delightful.
# config.yaml
model_list:
- model_name: smart # alias your app uses
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: smart-fallback
litellm_params:
model: gpt-5-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: cheap
litellm_params:
model: claude-haiku-4-5
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
routing_strategy: simple-shuffle
fallbacks:
- {model: smart, fallbacks: [smart-fallback]}
litellm_settings:
cache: true
cache_params:
type: redis
host: redis
# Your app — uses OpenAI SDK pointed at LiteLLM
from openai import OpenAI
client = OpenAI(base_url="http://litellm:4000", api_key="sk-virtual-team-1")
resp = client.chat.completions.create(
model="smart",
messages=[{"role": "user", "content": "hi"}],
)
LiteLLM is the default I reach for when I want a self-hostable gateway. The Helm chart is solid; runs on minimal resources.
Portkey (managed)
- Best for: Teams that want a polished dashboard and minimal ops.
- Strengths: Beautiful observability dashboard, prompt management, virtual keys, automatic retries with semantic caching.
- Weaknesses: Closed source; subscription cost.
Portkey is the SaaS gateway I’d pick for a fast-moving team that doesn’t want to think about gateway operations.
Helicone (managed, observability-focused)
- Best for: Teams that want observability above all.
- Strengths: Excellent traces, cost analytics, prompt versioning, plays nicely with LangChain/LangGraph.
- Weaknesses: Less of a routing/fallback gateway; more of an LLM-traffic-observer with proxy bonus.
Pair Helicone with LiteLLM if you want both worlds: LiteLLM for routing, Helicone for visibility. Or use Helicone alone if you mostly care about “what did our app call, when, why.”
OpenRouter (managed router)
- Best for: Multi-provider experimentation with one billing relationship.
- Strengths: 100+ models behind one API key, including open-source via various hosts.
- Weaknesses: Less control over per-team policy; primarily a router, less of a full gateway.
Great for indie devs and small teams that want access to many models without separate accounts.
Patterns you’ll add quickly
1. Fallback chain
primary: claude-sonnet-4-6
fallback 1: gpt-5-mini (Anthropic-style payloads adapt)
fallback 2: claude-haiku-4-5
When the primary 5xxs or rate-limits, the gateway tries the fallback. Your app sees a successful response either way.
2. Caching
Two flavors:
- Exact match. Identical prompt → identical response. Great for FAQ-style use.
- Semantic. Embedding-based similarity. Risky for conversational use; safe for “what’s the capital of France.”
Cache hit rate of 30%+ on production LLM apps is common. That’s 30% of your bill, gone.
3. Per-team / per-feature virtual keys
sk-team-billing → max $200/month, models: [haiku, sonnet]
sk-team-research → max $5000/month, models: [opus, sonnet]
sk-public-demo → max $10/day, models: [haiku]
Each team’s key has its own budget, model allowlist, and rate limit. Centralized policy.
4. Prompt management
Treat prompts like code. The gateway stores versions; your app references by ID. Update the prompt, A/B test in production, roll back if eval drops. Portkey and Helicone both ship this.
5. Audit trail
Every request logged with: timestamp, user, prompt, response, tokens, cost. For compliance (HIPAA, GDPR, SOC2), this is non-negotiable. The gateway gives it to you for free.
A real production setup
A typical 2026 setup for a team running multiple LLM features:
┌──────────┐
App A ──────▶│ LiteLLM │──▶ Anthropic
App B ──────▶│ (self- │──▶ OpenAI
Agent C ────▶│ hosted) │──▶ Self-hosted vLLM
└─────┬────┘
│ (logs)
▼
┌──────────┐
│ Helicone │ observability
└──────────┘
│
▼
┌──────────┐
│ Postgres │ cost attribution table
└──────────┘
LiteLLM as the proxy. Helicone for dashboards. Postgres for the per-team / per-feature cost rollups. ~$50/month total for a startup; ~$500/month at small-mid scale.
Where this fits in your stack
If you’re building anything LLM-shaped:
- Behind your application code: an AI gateway.
- Behind the gateway: providers (Anthropic, OpenAI, self-hosted).
- For observability across services: OpenTelemetry. See OpenTelemetry End-to-End .
- For agent orchestration: LangGraph / Temporal — see AI Agents with LangGraph and Temporal Durable Execution .
- For evaluation: an eval harness in CI. See LLM Evaluations .
Common mistakes
1. Skipping the gateway “because we use one provider”
Then your provider has an outage and your app is down. Or pricing changes and you have no levers. The cost of adopting a gateway later is much higher than starting with one.
2. Caching conversational chats
Caching is great for stateless prompts. For multi-turn chats, you’ll get weird behavior unless you scope cache by full conversation history. Most gateways handle this; verify your config.
3. Storing PII in cache or logs
Your gateway sees full prompts and responses. If those contain PII, you have a compliance surface. Either redact before sending or pick a self-hosted gateway in your VPC.
4. Trusting one fallback chain forever
Provider quality moves. Run evals through the gateway and re-rank fallbacks based on quality + cost + latency. Don’t set and forget.
5. Ignoring streaming
Many apps need SSE token streaming. Verify your gateway proxies streaming responses correctly — most do, some have edge cases. Test before shipping.
Read this next
- Anthropic Claude API + Tool Use Guide
- Self-Hosted LLMs in 2026 — Ollama, vLLM
- LLM Evaluations
- SSE vs WebSockets in 2026 — the streaming layer of the gateway you’ll need to verify.
If you want a Docker Compose with LiteLLM + Helicone + Postgres + Redis wired together as a starter AI gateway stack, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .