Deploying LLMs is its own discipline. The architecture choices compound: provider mix, inference server, routing, fallbacks, scaling. This post is the working playbook.
Architectures
[App] → [Router] → [Frontier API (Anthropic/OpenAI)]
→ [Self-hosted vLLM (fine-tuned 8B)]
→ [Embedding service]
→ [Cache (Redis)]
→ [Fallback chain]
Most production setups are hybrid. Frontier APIs handle the hard cases; self-hosted handles the high-volume narrow ones.
When to API
- Volume < 1B tokens/month.
- Latest model required.
- Spiky load (autoscaling provider).
- No GPU ops capacity.
When to self-host
- Volume > 1B tokens/month sustained.
- Fine-tuned model.
- Data residency / compliance.
- Cost-critical narrow use cases.
vLLM
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
OpenAI-compatible API on :8000. Batching, paged attention, prefix caching — best throughput available.
For multiple models / adapters:
vllm serve base-model --enable-lora --lora-modules a=/path/a b=/path/b
Switch adapter per request. See LLM Fine-Tuning .
Routing layer
async def route(query, complexity):
if complexity == "trivial":
return await self_hosted("haiku-fine-tuned", query)
if complexity == "moderate":
return await api("claude-haiku-4-5", query)
return await api("claude-sonnet-4-6", query)
Router by classification. See LLM Routing .
Fallbacks
When the primary model is degraded:
async def with_fallback(prompt):
for model in [PRIMARY, SECONDARY, TERTIARY]:
try:
return await call(model, prompt)
except (Timeout, RateLimit, Overloaded):
continue
raise AllProvidersDown
Multi-provider lets you survive outages: Anthropic + OpenAI + Bedrock as fallback chain.
Autoscaling GPUs
# Kubernetes HPA on GPU pods
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: vllm-llama }
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric: { name: vllm_inflight_requests }
target: { type: AverageValue, averageValue: "32" }
Scale on in-flight requests, not just CPU. Or use Knative Serving / KServe for scale-to-zero on idle.
GPU cold start is 30–60s. Use a warm pool (min replicas > 0) for latency-sensitive workloads.
Cost economics
For Llama 8B on a single A100:
- ~2k tokens/sec sustained throughput.
- A100 cost: ~$1.50/hr spot.
- Cost per 1M tokens: ~$0.21.
Compare to Sonnet API at $3/1M input. ~14× cheaper IF you can saturate the GPU.
If your A100 is at 10% utilization: cheaper per token… still $1.50/hr regardless. Self-host pays only with sustained load.
Caching layer
async def cached(prompt):
h = sha256(prompt.encode()).hexdigest()
if hit := await redis.get(f"llm:{h}"):
return hit.decode()
result = await llm(prompt)
await redis.set(f"llm:{h}", result, ex=86400)
return result
Common queries served from Redis. Free.
For semantic caching (“similar but not identical”): embedding similarity. See Embeddings .
Streaming end-to-end
User → API gateway → vLLM. Don’t buffer. SSE chunks pass through.
@app.get("/chat")
async def chat(q: str):
async def gen():
async for chunk in vllm_stream(q):
yield f"data: {chunk}\n\n"
return StreamingResponse(gen(), media_type="text/event-stream")
See FastAPI Streaming .
Multi-region
For latency / DR:
Region US: vLLM cluster + frontier API
Region EU: vLLM cluster + frontier API
Region APAC: frontier API only (no GPU footprint)
Route by client region. Comply with data residency.
Observability
Every LLM call logged with:
- Model.
- Tokens (input/output/cached).
- Latency (TTFT, total).
- Cost computed.
- User / session.
See LLM Observability .
Rate limiting
API providers enforce theirs; you enforce yours per user / tier:
async def with_rate_limit(user_id, tokens_estimated):
if not await allow(f"llm:{user_id}", tokens_estimated):
raise TooManyRequests
return await llm(...)
Token-bucket on tokens (not requests) — one big request can exhaust budget. See Rate Limiter Design .
Health and circuit breakers
breaker = CircuitBreaker(failure_threshold=5)
async def safe_llm_call(...):
return await breaker.call(llm.complete, ...)
Provider degraded? Trip the breaker; fall back. See Circuit Breakers .
Cost reporting
Per-feature dashboards:
SELECT feature, SUM(cost_usd) FROM llm_traces
WHERE ts > now() - interval '7 days' GROUP BY feature ORDER BY 2 DESC;
The 80/20 of spend reveals what to optimize. See LLM Cost Optimization .
Common mistakes
1. One provider only
Anthropic outage → your app is down. Use multi-provider with fallback.
2. Underutilized GPUs
Self-host for status; pay full GPU bill at 5% utilization. Hybrid or APIs unless you can saturate.
3. No caching
Same prompt 10k times/day; pay each time. Cache deterministic outputs.
4. No timeouts
LLM call hangs; thread blocked. Always set timeouts (30–60s typical).
5. No cost guardrails
A bug or runaway loop bills $100k overnight. Set per-user / per-feature daily caps.
What I’d ship today
For a new LLM-powered app:
- Frontier APIs (Anthropic + OpenAI) with fallback chain.
- Routing layer by complexity.
- Prompt + response caching in Redis.
- Per-user rate limits on tokens.
- Observability end-to-end.
- Cost dashboards with alerts.
- vLLM self-hosting added when one feature crosses cost-justification threshold.
Read this next
- Self-Hosted LLMs 2026 — vLLM, Ollama
- LLM Routing 2026
- LLM Cost Optimization 2026
- LLM Observability 2026
If you want my hybrid LLM deployment reference (Kubernetes + vLLM + router), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .