Deploying LLMs is its own discipline. The architecture choices compound: provider mix, inference server, routing, fallbacks, scaling. This post is the working playbook.

Architectures

[App] → [Router] → [Frontier API (Anthropic/OpenAI)]
                 → [Self-hosted vLLM (fine-tuned 8B)]
                 → [Embedding service]
                 → [Cache (Redis)]
                 → [Fallback chain]

Most production setups are hybrid. Frontier APIs handle the hard cases; self-hosted handles the high-volume narrow ones.

When to API

  • Volume < 1B tokens/month.
  • Latest model required.
  • Spiky load (autoscaling provider).
  • No GPU ops capacity.

When to self-host

  • Volume > 1B tokens/month sustained.
  • Fine-tuned model.
  • Data residency / compliance.
  • Cost-critical narrow use cases.

vLLM

pip install vllm

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching

OpenAI-compatible API on :8000. Batching, paged attention, prefix caching — best throughput available.

For multiple models / adapters:

vllm serve base-model --enable-lora --lora-modules a=/path/a b=/path/b

Switch adapter per request. See LLM Fine-Tuning .

Routing layer

async def route(query, complexity):
    if complexity == "trivial":
        return await self_hosted("haiku-fine-tuned", query)
    if complexity == "moderate":
        return await api("claude-haiku-4-5", query)
    return await api("claude-sonnet-4-6", query)

Router by classification. See LLM Routing .

Fallbacks

When the primary model is degraded:

async def with_fallback(prompt):
    for model in [PRIMARY, SECONDARY, TERTIARY]:
        try:
            return await call(model, prompt)
        except (Timeout, RateLimit, Overloaded):
            continue
    raise AllProvidersDown

Multi-provider lets you survive outages: Anthropic + OpenAI + Bedrock as fallback chain.

Autoscaling GPUs

# Kubernetes HPA on GPU pods
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: vllm-llama }
spec:
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric: { name: vllm_inflight_requests }
        target: { type: AverageValue, averageValue: "32" }

Scale on in-flight requests, not just CPU. Or use Knative Serving / KServe for scale-to-zero on idle.

GPU cold start is 30–60s. Use a warm pool (min replicas > 0) for latency-sensitive workloads.

Cost economics

For Llama 8B on a single A100:

  • ~2k tokens/sec sustained throughput.
  • A100 cost: ~$1.50/hr spot.
  • Cost per 1M tokens: ~$0.21.

Compare to Sonnet API at $3/1M input. ~14× cheaper IF you can saturate the GPU.

If your A100 is at 10% utilization: cheaper per token… still $1.50/hr regardless. Self-host pays only with sustained load.

Caching layer

async def cached(prompt):
    h = sha256(prompt.encode()).hexdigest()
    if hit := await redis.get(f"llm:{h}"):
        return hit.decode()
    result = await llm(prompt)
    await redis.set(f"llm:{h}", result, ex=86400)
    return result

Common queries served from Redis. Free.

For semantic caching (“similar but not identical”): embedding similarity. See Embeddings .

Streaming end-to-end

User → API gateway → vLLM. Don’t buffer. SSE chunks pass through.

@app.get("/chat")
async def chat(q: str):
    async def gen():
        async for chunk in vllm_stream(q):
            yield f"data: {chunk}\n\n"
    return StreamingResponse(gen(), media_type="text/event-stream")

See FastAPI Streaming .

Multi-region

For latency / DR:

Region US: vLLM cluster + frontier API
Region EU: vLLM cluster + frontier API
Region APAC: frontier API only (no GPU footprint)

Route by client region. Comply with data residency.

Observability

Every LLM call logged with:

  • Model.
  • Tokens (input/output/cached).
  • Latency (TTFT, total).
  • Cost computed.
  • User / session.

See LLM Observability .

Rate limiting

API providers enforce theirs; you enforce yours per user / tier:

async def with_rate_limit(user_id, tokens_estimated):
    if not await allow(f"llm:{user_id}", tokens_estimated):
        raise TooManyRequests
    return await llm(...)

Token-bucket on tokens (not requests) — one big request can exhaust budget. See Rate Limiter Design .

Health and circuit breakers

breaker = CircuitBreaker(failure_threshold=5)

async def safe_llm_call(...):
    return await breaker.call(llm.complete, ...)

Provider degraded? Trip the breaker; fall back. See Circuit Breakers .

Cost reporting

Per-feature dashboards:

SELECT feature, SUM(cost_usd) FROM llm_traces
WHERE ts > now() - interval '7 days' GROUP BY feature ORDER BY 2 DESC;

The 80/20 of spend reveals what to optimize. See LLM Cost Optimization .

Common mistakes

1. One provider only

Anthropic outage → your app is down. Use multi-provider with fallback.

2. Underutilized GPUs

Self-host for status; pay full GPU bill at 5% utilization. Hybrid or APIs unless you can saturate.

3. No caching

Same prompt 10k times/day; pay each time. Cache deterministic outputs.

4. No timeouts

LLM call hangs; thread blocked. Always set timeouts (30–60s typical).

5. No cost guardrails

A bug or runaway loop bills $100k overnight. Set per-user / per-feature daily caps.

What I’d ship today

For a new LLM-powered app:

  1. Frontier APIs (Anthropic + OpenAI) with fallback chain.
  2. Routing layer by complexity.
  3. Prompt + response caching in Redis.
  4. Per-user rate limits on tokens.
  5. Observability end-to-end.
  6. Cost dashboards with alerts.
  7. vLLM self-hosting added when one feature crosses cost-justification threshold.

Read this next

If you want my hybrid LLM deployment reference (Kubernetes + vLLM + router), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .