In 2026 every serious LLM application sits behind an AI gateway. The gateway gives you one API surface for many providers, automatic fallback, cost tracking, caching, rate limiting, observability, and a clean cost-attribution layer. This post is the working guide to picking and using one.

What an AI gateway does

              ┌──────────────────────────┐
   App ──────▶│   AI Gateway             │
              │   • Auth & rate limit    │
              │   • Routing & fallback   │
              │   • Caching              │
              │   • Cost tracking        │
              │   • Observability        │
              └──────┬───────────────────┘
       ┌─────────────┼──────────────┐
       ▼             ▼              ▼
   Anthropic      OpenAI         Google
                                 / vLLM / Together

Your app talks to one URL. The gateway:

  • Routes — picks the right provider/model per request.
  • Falls back — if Anthropic is down, retry with OpenAI.
  • Caches — same prompt → cached response.
  • Rate limits — per team, per model, per dollar.
  • Tracks — every call logged for billing and debugging.
  • Standardizes — OpenAI-compatible API, regardless of upstream.

The “OpenAI-compatible façade” is the common shape: gateways accept the Chat Completions API; your app uses the OpenAI SDK with base_url= pointing at the gateway.

Why you want one

Even for a single-provider single-model app:

  • Caching — repeated prompts hit cache, cut cost 30–80%.
  • Retry/fallback — one provider being down doesn’t take you down.
  • Observability — every call traced; per-user / per-feature attribution.
  • Cost control — set hard limits per team / API key.
  • Rate limiting — protect your own service from upstream spikes.
  • Easy provider swap — when a model retires, change config not code.

The cost is one extra hop (~5–20ms). For most LLM apps this is rounding error.

The contenders

LiteLLM (open source, self-host or managed)

  • Best for: Teams that want control. Self-host on a small VM or Kubernetes.
  • Strengths: Supports 100+ providers, OpenAI-compatible, virtual keys, rate limit + cost budget per key.
  • Weaknesses: Operate it yourself; UI is functional, not delightful.
# config.yaml
model_list:
  - model_name: smart            # alias your app uses
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart-fallback
    litellm_params:
      model: gpt-5-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: cheap
    litellm_params:
      model: claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: simple-shuffle
  fallbacks:
    - {model: smart, fallbacks: [smart-fallback]}

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
# Your app — uses OpenAI SDK pointed at LiteLLM
from openai import OpenAI

client = OpenAI(base_url="http://litellm:4000", api_key="sk-virtual-team-1")

resp = client.chat.completions.create(
    model="smart",
    messages=[{"role": "user", "content": "hi"}],
)

LiteLLM is the default I reach for when I want a self-hostable gateway. The Helm chart is solid; runs on minimal resources.

Portkey (managed)

  • Best for: Teams that want a polished dashboard and minimal ops.
  • Strengths: Beautiful observability dashboard, prompt management, virtual keys, automatic retries with semantic caching.
  • Weaknesses: Closed source; subscription cost.

Portkey is the SaaS gateway I’d pick for a fast-moving team that doesn’t want to think about gateway operations.

Helicone (managed, observability-focused)

  • Best for: Teams that want observability above all.
  • Strengths: Excellent traces, cost analytics, prompt versioning, plays nicely with LangChain/LangGraph.
  • Weaknesses: Less of a routing/fallback gateway; more of an LLM-traffic-observer with proxy bonus.

Pair Helicone with LiteLLM if you want both worlds: LiteLLM for routing, Helicone for visibility. Or use Helicone alone if you mostly care about “what did our app call, when, why.”

OpenRouter (managed router)

  • Best for: Multi-provider experimentation with one billing relationship.
  • Strengths: 100+ models behind one API key, including open-source via various hosts.
  • Weaknesses: Less control over per-team policy; primarily a router, less of a full gateway.

Great for indie devs and small teams that want access to many models without separate accounts.

Patterns you’ll add quickly

1. Fallback chain

primary: claude-sonnet-4-6
fallback 1: gpt-5-mini   (Anthropic-style payloads adapt)
fallback 2: claude-haiku-4-5

When the primary 5xxs or rate-limits, the gateway tries the fallback. Your app sees a successful response either way.

2. Caching

Two flavors:

  • Exact match. Identical prompt → identical response. Great for FAQ-style use.
  • Semantic. Embedding-based similarity. Risky for conversational use; safe for “what’s the capital of France.”

Cache hit rate of 30%+ on production LLM apps is common. That’s 30% of your bill, gone.

3. Per-team / per-feature virtual keys

sk-team-billing → max $200/month, models: [haiku, sonnet]
sk-team-research → max $5000/month, models: [opus, sonnet]
sk-public-demo → max $10/day, models: [haiku]

Each team’s key has its own budget, model allowlist, and rate limit. Centralized policy.

4. Prompt management

Treat prompts like code. The gateway stores versions; your app references by ID. Update the prompt, A/B test in production, roll back if eval drops. Portkey and Helicone both ship this.

5. Audit trail

Every request logged with: timestamp, user, prompt, response, tokens, cost. For compliance (HIPAA, GDPR, SOC2), this is non-negotiable. The gateway gives it to you for free.

A real production setup

A typical 2026 setup for a team running multiple LLM features:

                ┌──────────┐
   App A ──────▶│ LiteLLM  │──▶ Anthropic
   App B ──────▶│ (self-   │──▶ OpenAI
   Agent C ────▶│  hosted) │──▶ Self-hosted vLLM
                └─────┬────┘
                      │ (logs)
                ┌──────────┐
                │ Helicone │  observability
                └──────────┘
                ┌──────────┐
                │ Postgres │  cost attribution table
                └──────────┘

LiteLLM as the proxy. Helicone for dashboards. Postgres for the per-team / per-feature cost rollups. ~$50/month total for a startup; ~$500/month at small-mid scale.

Where this fits in your stack

If you’re building anything LLM-shaped:

Common mistakes

1. Skipping the gateway “because we use one provider”

Then your provider has an outage and your app is down. Or pricing changes and you have no levers. The cost of adopting a gateway later is much higher than starting with one.

2. Caching conversational chats

Caching is great for stateless prompts. For multi-turn chats, you’ll get weird behavior unless you scope cache by full conversation history. Most gateways handle this; verify your config.

3. Storing PII in cache or logs

Your gateway sees full prompts and responses. If those contain PII, you have a compliance surface. Either redact before sending or pick a self-hosted gateway in your VPC.

4. Trusting one fallback chain forever

Provider quality moves. Run evals through the gateway and re-rank fallbacks based on quality + cost + latency. Don’t set and forget.

5. Ignoring streaming

Many apps need SSE token streaming. Verify your gateway proxies streaming responses correctly — most do, some have edge cases. Test before shipping.

Read this next

If you want a Docker Compose with LiteLLM + Helicone + Postgres + Redis wired together as a starter AI gateway stack, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .