By 2026, open-weight models close enough to GPT-4-class are everywhere. Llama 3.3, Qwen 2.5, DeepSeek V3 all sit in the “good enough for most things” tier. The question is no longer “is the open model good enough?” but “is self-hosting it worth the operational cost?”
This post is the working answer. When to self-host, what to use (Ollama vs vLLM), how to size hardware, and the production patterns that make it pencil out.
Should you self-host?
The honest answer is usually no. The hosted APIs (Anthropic, OpenAI, Google) are excellent, billed by token, and require zero ops. For most apps, pip install anthropic is the right answer.
Self-host when one or more is true:
- Data residency / compliance — the data cannot leave your infrastructure.
- Latency — you need <100ms first-token; APIs are typically 200–500ms.
- Volume — you’re paying $50k+/month in API costs and have ML/infra capacity.
- Custom fine-tunes — you want to ship a model trained on your domain.
- No-internet operation — air-gapped or edge deployments.
For everything else, use the API. Self-hosting is real work.
The two engines that matter
Ollama — for dev, prototypes, internal tools
ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M
That’s it. Single binary, GGUF models, llama.cpp under the hood. Runs on Mac, Linux, even mid-tier laptops with quantized models.
- Pros: Trivial setup. Sane defaults. Good for dev, hobby, internal tools, edge.
- Cons: Single-request-at-a-time throughput. Not built for production concurrency. KV cache reuse is limited.
For development and internal tools serving < ~5 concurrent users, Ollama is fine. Past that, switch.
vLLM — for production
pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--port 8000
vLLM is the production engine: continuous batching, PagedAttention KV cache, prefix caching. It serves OpenAI-compatible APIs out of the box.
- Pros: 5–20× higher throughput than naive serving. Solid production behavior. Active community.
- Cons: Requires GPUs (or reasonably modern CPUs for small models). More ops.
SGLang — the alternative
SGLang has caught up to vLLM in 2026 and surpassed it on certain workloads (long contexts, structured generation). Worth benchmarking against vLLM for your specific access pattern.
python -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --port 30000
Same OpenAI-compatible API. For agent-style workloads with constrained outputs, SGLang’s RadixAttention pays off.
Picking a model in 2026
Snapshot of the open-weight landscape:
| Model | Size | Use |
|---|---|---|
| Llama 3.3 70B | 70B | General-purpose, GPT-4-class on many benchmarks |
| Qwen 2.5 72B | 72B | Strong multilingual, coding |
| DeepSeek V3 | 671B (MoE) | Frontier-class, MoE serves at 37B activated |
| Mistral Large 2 | 123B | Strong reasoning, multilingual |
| Llama 3.1 8B | 8B | Workhorse small model |
| Qwen 2.5 7B | 7B | Strong small model, MIT license |
| Phi-4 | 14B | Microsoft’s small model, good reasoning |
For production: a 70B-class model quantized to int4 or int8 usually hits the sweet spot between quality and cost. Drop to 7–14B for high-volume, low-stakes work (classification, extraction, summarization).
For agents with tool calls, get a model that’s been instruction-tuned for it — Llama 3.3 70B Instruct, Qwen 2.5 72B Instruct, or specifically tool-tuned variants.
Hardware sizing
Rough guidance for serving:
| Model size | Quant | VRAM needed | Typical GPU |
|---|---|---|---|
| 7B | bf16 | ~16 GB | L4, 4090, A10 |
| 7B | int4 | ~6 GB | T4, 3090 |
| 70B | bf16 | ~140 GB | 2× A100 80GB / 4× L40S |
| 70B | int4 | ~40 GB | 1× A100 80GB / 2× L40S |
| 670B (MoE) | int4 | ~250 GB | Multi-node A100/H100 |
Add ~20–50% for KV cache and concurrency headroom. Quantization is now a tax you pay; the quality loss is usually <2% on real evals (do verify on your data).
For most teams, 2× L40S or 1× H100 serving a 70B int4 is the sweet spot. Cost: roughly $2/hour on the cloud spot market in 2026. Compare to your API spend.
Throughput — what to expect
vLLM serving Llama 3.3 70B int4 on 2×L40S:
- First token latency: 50–150 ms
- Tokens / sec / request: 50–100
- Concurrent throughput: 4000–8000 tok/s aggregated across 32 in-flight requests
- Cost per 1M output tokens: ~$0.20–0.50
Compare to hosted (rough 2026 prices):
- GPT-4o-mini: $0.60 / 1M output
- Claude Haiku 4.5: $1.25 / 1M output
- Claude Sonnet 4.6: $15 / 1M output
- Llama 3.3 70B (Together / Anyscale): $0.50–0.90 / 1M output
The hosted-Llama prices are barely above DIY at moderate scale. Self-hosting wins decisively only at multi-million-tokens-per-day scale, or for non-cost reasons.
A production deployment
# vllm-deployment.yaml — Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm }
spec:
replicas: 2
selector: { matchLabels: { app: vllm } }
template:
metadata: { labels: { app: vllm } }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.7.0
args:
- --model=meta-llama/Llama-3.3-70B-Instruct
- --tensor-parallel-size=2
- --max-model-len=32768
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
- --enable-chunked-prefill
ports: [{ containerPort: 8000 }]
resources:
limits: { nvidia.com/gpu: "2" }
readinessProbe:
httpGet: { path: /health, port: 8000 }
periodSeconds: 5
A few production knobs:
enable-prefix-caching— KV cache reuse for shared prefixes (system prompts!). Massive win for chat.enable-chunked-prefill— better throughput under mixed prefill/decode workload.- Two replicas, GPU affinity — for HA.
- Probes —
/healthis shipped by vLLM. Use it.
In front of vLLM:
- An LB or service mesh (Istio, Linkerd) for routing.
- A small Python or Go gateway for auth, rate limiting, prompt logging.
Treat it like an API
Self-hosting doesn’t mean “different API.” vLLM speaks the OpenAI shape:
from openai import OpenAI
client = OpenAI(base_url="http://vllm:8000/v1", api_key="local")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "..."}],
)
Same SDK as hosted. Move from API to self-host with a base URL change.
Streaming, structured outputs, tool use
vLLM/SGLang both support:
- Streaming via SSE — same API as OpenAI.
- Guided/constrained generation — pin output to a JSON schema, regex, or grammar. Free correctness for extraction tasks.
- Tool use — Llama 3.3 and Qwen 2.5 both have tool-call training; vLLM exposes the spec.
resp = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[...],
tools=[{...}],
tool_choice="auto",
)
Almost everything you do against the OpenAI/Anthropic API works against vLLM. The patterns from Anthropic Claude API + Tool Use carry over.
Scaling patterns
Multi-replica with sticky prefix routing
For chat with prefix caching, route the same conversation to the same replica so its KV cache hits. Use a hash of the conversation ID as the routing key.
Speculative decoding
Run a small “draft” model that proposes 4–8 tokens; the big model verifies them. 2–3× throughput on many workloads. vLLM supports it natively.
Disaggregated prefill/decode
Split prefill (compute-heavy, tokenize prompt) onto separate GPUs from decode (memory-heavy, generate tokens). Best at scale. Production setups in 2026 increasingly do this.
Quantization
If you haven’t quantized, you’re leaving 2–4× throughput on the table. AWQ, GPTQ, FP8 — all viable. Re-eval on your data after quantizing. Most quality drops are <2%.
Observability
# Wrap every call with structured logging + spans
@tracer.start_as_current_span("llm.complete")
async def complete(messages):
span = trace.get_current_span()
span.set_attribute("llm.model", MODEL)
span.set_attribute("llm.input_tokens", count_tokens(messages))
resp = await client.chat.completions.create(...)
span.set_attribute("llm.output_tokens", resp.usage.completion_tokens)
return resp
Track per-request:
- Tokens in/out
- Time-to-first-token
- Total latency
- Cache hit rate (vLLM exposes prefix cache hits in metrics)
- Failure types (OOM, timeout, validation)
vLLM exposes Prometheus metrics. Wire them into your observability stack — see OpenTelemetry End-to-End .
Common mistakes
1. Underestimating ops cost
GPUs run hot. Drivers update. Models change shape between releases. Operating an LLM service is real work — typically a half-FTE for a non-trivial deployment.
2. Not benchmarking on real prompts
Lean benchmark numbers are 2k-token prompts. Your prompts are 8k. Your batching looks different. Measure on your traffic.
3. Picking 70B when 7B fits
A 7B fine-tuned on your data often beats a 70B prompted to do the same. Try the small model first, especially for narrow tasks.
4. No fallback to hosted
When your GPU pod restarts, what serves traffic? An app that can fall back to a hosted API during outages keeps users happy. Same SDK, different base URL.
5. Forgetting context length
Llama 3.3 70B has 128k context. Most prompts are 2k. KV cache for the unused 126k allocates VRAM you didn’t need. Set --max-model-len to a realistic value to free GPU memory for concurrency.
When I’d build it today
For a non-trivial company in 2026 looking at AI infra:
- Build the app on Anthropic/OpenAI APIs first. Ship product. Measure cost.
- At ~$30k/month spend on a single dominant workload, evaluate self-hosting that workload on Llama 3.3 70B.
- Use a managed inference provider first (Together, Fireworks, Anyscale, AWS Bedrock for open models) before going to bare GPUs.
- Self-host on bare GPUs only when the savings/control justify a half-time engineer’s work.
Read this next
- Anthropic Claude API + Tool Use Guide — the API path, applicable to vLLM too.
- Build a RAG App with pgvector + FastAPI — a typical workload for self-hosting evaluation.
- The vLLM and SGLang docs — both are excellent in 2026.
If you want a Docker Compose vLLM + observability stack you can run locally, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .