By 2026, open-weight models close enough to GPT-4-class are everywhere. Llama 3.3, Qwen 2.5, DeepSeek V3 all sit in the “good enough for most things” tier. The question is no longer “is the open model good enough?” but “is self-hosting it worth the operational cost?”

This post is the working answer. When to self-host, what to use (Ollama vs vLLM), how to size hardware, and the production patterns that make it pencil out.

Should you self-host?

The honest answer is usually no. The hosted APIs (Anthropic, OpenAI, Google) are excellent, billed by token, and require zero ops. For most apps, pip install anthropic is the right answer.

Self-host when one or more is true:

  • Data residency / compliance — the data cannot leave your infrastructure.
  • Latency — you need <100ms first-token; APIs are typically 200–500ms.
  • Volume — you’re paying $50k+/month in API costs and have ML/infra capacity.
  • Custom fine-tunes — you want to ship a model trained on your domain.
  • No-internet operation — air-gapped or edge deployments.

For everything else, use the API. Self-hosting is real work.

The two engines that matter

Ollama — for dev, prototypes, internal tools

ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M

That’s it. Single binary, GGUF models, llama.cpp under the hood. Runs on Mac, Linux, even mid-tier laptops with quantized models.

  • Pros: Trivial setup. Sane defaults. Good for dev, hobby, internal tools, edge.
  • Cons: Single-request-at-a-time throughput. Not built for production concurrency. KV cache reuse is limited.

For development and internal tools serving < ~5 concurrent users, Ollama is fine. Past that, switch.

vLLM — for production

pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --port 8000

vLLM is the production engine: continuous batching, PagedAttention KV cache, prefix caching. It serves OpenAI-compatible APIs out of the box.

  • Pros: 5–20× higher throughput than naive serving. Solid production behavior. Active community.
  • Cons: Requires GPUs (or reasonably modern CPUs for small models). More ops.

SGLang — the alternative

SGLang has caught up to vLLM in 2026 and surpassed it on certain workloads (long contexts, structured generation). Worth benchmarking against vLLM for your specific access pattern.

python -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --port 30000

Same OpenAI-compatible API. For agent-style workloads with constrained outputs, SGLang’s RadixAttention pays off.

Picking a model in 2026

Snapshot of the open-weight landscape:

ModelSizeUse
Llama 3.3 70B70BGeneral-purpose, GPT-4-class on many benchmarks
Qwen 2.5 72B72BStrong multilingual, coding
DeepSeek V3671B (MoE)Frontier-class, MoE serves at 37B activated
Mistral Large 2123BStrong reasoning, multilingual
Llama 3.1 8B8BWorkhorse small model
Qwen 2.5 7B7BStrong small model, MIT license
Phi-414BMicrosoft’s small model, good reasoning

For production: a 70B-class model quantized to int4 or int8 usually hits the sweet spot between quality and cost. Drop to 7–14B for high-volume, low-stakes work (classification, extraction, summarization).

For agents with tool calls, get a model that’s been instruction-tuned for it — Llama 3.3 70B Instruct, Qwen 2.5 72B Instruct, or specifically tool-tuned variants.

Hardware sizing

Rough guidance for serving:

Model sizeQuantVRAM neededTypical GPU
7Bbf16~16 GBL4, 4090, A10
7Bint4~6 GBT4, 3090
70Bbf16~140 GB2× A100 80GB / 4× L40S
70Bint4~40 GB1× A100 80GB / 2× L40S
670B (MoE)int4~250 GBMulti-node A100/H100

Add ~20–50% for KV cache and concurrency headroom. Quantization is now a tax you pay; the quality loss is usually <2% on real evals (do verify on your data).

For most teams, 2× L40S or 1× H100 serving a 70B int4 is the sweet spot. Cost: roughly $2/hour on the cloud spot market in 2026. Compare to your API spend.

Throughput — what to expect

vLLM serving Llama 3.3 70B int4 on 2×L40S:

  • First token latency: 50–150 ms
  • Tokens / sec / request: 50–100
  • Concurrent throughput: 4000–8000 tok/s aggregated across 32 in-flight requests
  • Cost per 1M output tokens: ~$0.20–0.50

Compare to hosted (rough 2026 prices):

  • GPT-4o-mini: $0.60 / 1M output
  • Claude Haiku 4.5: $1.25 / 1M output
  • Claude Sonnet 4.6: $15 / 1M output
  • Llama 3.3 70B (Together / Anyscale): $0.50–0.90 / 1M output

The hosted-Llama prices are barely above DIY at moderate scale. Self-hosting wins decisively only at multi-million-tokens-per-day scale, or for non-cost reasons.

A production deployment

# vllm-deployment.yaml — Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm }
spec:
  replicas: 2
  selector: { matchLabels: { app: vllm } }
  template:
    metadata: { labels: { app: vllm } }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.7.0
          args:
            - --model=meta-llama/Llama-3.3-70B-Instruct
            - --tensor-parallel-size=2
            - --max-model-len=32768
            - --gpu-memory-utilization=0.9
            - --enable-prefix-caching
            - --enable-chunked-prefill
          ports: [{ containerPort: 8000 }]
          resources:
            limits: { nvidia.com/gpu: "2" }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            periodSeconds: 5

A few production knobs:

  • enable-prefix-caching — KV cache reuse for shared prefixes (system prompts!). Massive win for chat.
  • enable-chunked-prefill — better throughput under mixed prefill/decode workload.
  • Two replicas, GPU affinity — for HA.
  • Probes/health is shipped by vLLM. Use it.

In front of vLLM:

  • An LB or service mesh (Istio, Linkerd) for routing.
  • A small Python or Go gateway for auth, rate limiting, prompt logging.

Treat it like an API

Self-hosting doesn’t mean “different API.” vLLM speaks the OpenAI shape:

from openai import OpenAI

client = OpenAI(base_url="http://vllm:8000/v1", api_key="local")
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "..."}],
)

Same SDK as hosted. Move from API to self-host with a base URL change.

Streaming, structured outputs, tool use

vLLM/SGLang both support:

  • Streaming via SSE — same API as OpenAI.
  • Guided/constrained generation — pin output to a JSON schema, regex, or grammar. Free correctness for extraction tasks.
  • Tool use — Llama 3.3 and Qwen 2.5 both have tool-call training; vLLM exposes the spec.
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[...],
    tools=[{...}],
    tool_choice="auto",
)

Almost everything you do against the OpenAI/Anthropic API works against vLLM. The patterns from Anthropic Claude API + Tool Use carry over.

Scaling patterns

Multi-replica with sticky prefix routing

For chat with prefix caching, route the same conversation to the same replica so its KV cache hits. Use a hash of the conversation ID as the routing key.

Speculative decoding

Run a small “draft” model that proposes 4–8 tokens; the big model verifies them. 2–3× throughput on many workloads. vLLM supports it natively.

Disaggregated prefill/decode

Split prefill (compute-heavy, tokenize prompt) onto separate GPUs from decode (memory-heavy, generate tokens). Best at scale. Production setups in 2026 increasingly do this.

Quantization

If you haven’t quantized, you’re leaving 2–4× throughput on the table. AWQ, GPTQ, FP8 — all viable. Re-eval on your data after quantizing. Most quality drops are <2%.

Observability

# Wrap every call with structured logging + spans
@tracer.start_as_current_span("llm.complete")
async def complete(messages):
    span = trace.get_current_span()
    span.set_attribute("llm.model", MODEL)
    span.set_attribute("llm.input_tokens", count_tokens(messages))
    resp = await client.chat.completions.create(...)
    span.set_attribute("llm.output_tokens", resp.usage.completion_tokens)
    return resp

Track per-request:

  • Tokens in/out
  • Time-to-first-token
  • Total latency
  • Cache hit rate (vLLM exposes prefix cache hits in metrics)
  • Failure types (OOM, timeout, validation)

vLLM exposes Prometheus metrics. Wire them into your observability stack — see OpenTelemetry End-to-End .

Common mistakes

1. Underestimating ops cost

GPUs run hot. Drivers update. Models change shape between releases. Operating an LLM service is real work — typically a half-FTE for a non-trivial deployment.

2. Not benchmarking on real prompts

Lean benchmark numbers are 2k-token prompts. Your prompts are 8k. Your batching looks different. Measure on your traffic.

3. Picking 70B when 7B fits

A 7B fine-tuned on your data often beats a 70B prompted to do the same. Try the small model first, especially for narrow tasks.

4. No fallback to hosted

When your GPU pod restarts, what serves traffic? An app that can fall back to a hosted API during outages keeps users happy. Same SDK, different base URL.

5. Forgetting context length

Llama 3.3 70B has 128k context. Most prompts are 2k. KV cache for the unused 126k allocates VRAM you didn’t need. Set --max-model-len to a realistic value to free GPU memory for concurrency.

When I’d build it today

For a non-trivial company in 2026 looking at AI infra:

  • Build the app on Anthropic/OpenAI APIs first. Ship product. Measure cost.
  • At ~$30k/month spend on a single dominant workload, evaluate self-hosting that workload on Llama 3.3 70B.
  • Use a managed inference provider first (Together, Fireworks, Anyscale, AWS Bedrock for open models) before going to bare GPUs.
  • Self-host on bare GPUs only when the savings/control justify a half-time engineer’s work.

Read this next

If you want a Docker Compose vLLM + observability stack you can run locally, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .