Self-Hosted LLMs in 2026 — Ollama, vLLM, and When to Skip the API

By 2026, open-weight models close enough to GPT-4-class are everywhere. Llama 3.3, Qwen 2.5, DeepSeek V3 all sit in the “good enough for most things” tier. The question is no longer “is the open model good enough?” but “is self-hosting it worth the operational cost?”

This post is the working answer. When to self-host, what to use (Ollama vs vLLM), how to size hardware, and the production patterns that make it pencil out.

Should you self-host?

The honest answer is usually no. The hosted APIs (Anthropic, OpenAI, Google) are excellent, billed by token, and require zero ops. For most apps, pip install anthropic is the right answer.

Self-host when one or more is true:

Data residency / compliance — the data cannot leave your infrastructure.
Latency — you need <100ms first-token; APIs are typically 200–500ms.
Volume — you’re paying $50k+/month in API costs and have ML/infra capacity.
Custom fine-tunes — you want to ship a model trained on your domain.
No-internet operation — air-gapped or edge deployments.

For everything else, use the API. Self-hosting is real work.

The two engines that matter

Ollama — for dev, prototypes, internal tools

ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M

That’s it. Single binary, GGUF models, llama.cpp under the hood. Runs on Mac, Linux, even mid-tier laptops with quantized models.

Pros: Trivial setup. Sane defaults. Good for dev, hobby, internal tools, edge.
Cons: Single-request-at-a-time throughput. Not built for production concurrency. KV cache reuse is limited.

For development and internal tools serving < ~5 concurrent users, Ollama is fine. Past that, switch.

vLLM — for production

pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --port 8000

vLLM is the production engine: continuous batching, PagedAttention KV cache, prefix caching. It serves OpenAI-compatible APIs out of the box.

Pros: 5–20× higher throughput than naive serving. Solid production behavior. Active community.
Cons: Requires GPUs (or reasonably modern CPUs for small models). More ops.

SGLang — the alternative

SGLang has caught up to vLLM in 2026 and surpassed it on certain workloads (long contexts, structured generation). Worth benchmarking against vLLM for your specific access pattern.

python -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --port 30000

Same OpenAI-compatible API. For agent-style workloads with constrained outputs, SGLang’s RadixAttention pays off.

Picking a model in 2026

Snapshot of the open-weight landscape:

Model	Size	Use
Llama 3.3 70B	70B	General-purpose, GPT-4-class on many benchmarks
Qwen 2.5 72B	72B	Strong multilingual, coding
DeepSeek V3	671B (MoE)	Frontier-class, MoE serves at 37B activated
Mistral Large 2	123B	Strong reasoning, multilingual
Llama 3.1 8B	8B	Workhorse small model
Qwen 2.5 7B	7B	Strong small model, MIT license
Phi-4	14B	Microsoft’s small model, good reasoning

For production: a 70B-class model quantized to int4 or int8 usually hits the sweet spot between quality and cost. Drop to 7–14B for high-volume, low-stakes work (classification, extraction, summarization).

For agents with tool calls, get a model that’s been instruction-tuned for it — Llama 3.3 70B Instruct, Qwen 2.5 72B Instruct, or specifically tool-tuned variants.

Hardware sizing

Rough guidance for serving:

Model size	Quant	VRAM needed	Typical GPU
7B	bf16	~16 GB	L4, 4090, A10
7B	int4	~6 GB	T4, 3090
70B	bf16	~140 GB	2× A100 80GB / 4× L40S
70B	int4	~40 GB	1× A100 80GB / 2× L40S
670B (MoE)	int4	~250 GB	Multi-node A100/H100

Add ~20–50% for KV cache and concurrency headroom. Quantization is now a tax you pay; the quality loss is usually <2% on real evals (do verify on your data).

For most teams, 2× L40S or 1× H100 serving a 70B int4 is the sweet spot. Cost: roughly $2/hour on the cloud spot market in 2026. Compare to your API spend.

Throughput — what to expect

vLLM serving Llama 3.3 70B int4 on 2×L40S:

First token latency: 50–150 ms
Tokens / sec / request: 50–100
Concurrent throughput: 4000–8000 tok/s aggregated across 32 in-flight requests
Cost per 1M output tokens: ~$0.20–0.50

Compare to hosted (rough 2026 prices):

GPT-4o-mini: $0.60 / 1M output
Claude Haiku 4.5: $1.25 / 1M output
Claude Sonnet 4.6: $15 / 1M output
Llama 3.3 70B (Together / Anyscale): $0.50–0.90 / 1M output

The hosted-Llama prices are barely above DIY at moderate scale. Self-hosting wins decisively only at multi-million-tokens-per-day scale, or for non-cost reasons.

A production deployment

# vllm-deployment.yaml — Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm }
spec:
  replicas: 2
  selector: { matchLabels: { app: vllm } }
  template:
    metadata: { labels: { app: vllm } }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.7.0
          args:
            - --model=meta-llama/Llama-3.3-70B-Instruct
            - --tensor-parallel-size=2
            - --max-model-len=32768
            - --gpu-memory-utilization=0.9
            - --enable-prefix-caching
            - --enable-chunked-prefill
          ports: [{ containerPort: 8000 }]
          resources:
            limits: { nvidia.com/gpu: "2" }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            periodSeconds: 5

A few production knobs:

enable-prefix-caching — KV cache reuse for shared prefixes (system prompts!). Massive win for chat.
enable-chunked-prefill — better throughput under mixed prefill/decode workload.
Two replicas, GPU affinity — for HA.
Probes — /health is shipped by vLLM. Use it.

In front of vLLM:

An LB or service mesh (Istio, Linkerd) for routing.
A small Python or Go gateway for auth, rate limiting, prompt logging.

Treat it like an API

Self-hosting doesn’t mean “different API.” vLLM speaks the OpenAI shape:

from openai import OpenAI

client = OpenAI(base_url="http://vllm:8000/v1", api_key="local")
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "..."}],
)

Same SDK as hosted. Move from API to self-host with a base URL change.

Streaming, structured outputs, tool use

vLLM/SGLang both support:

Streaming via SSE — same API as OpenAI.
Guided/constrained generation — pin output to a JSON schema, regex, or grammar. Free correctness for extraction tasks.
Tool use — Llama 3.3 and Qwen 2.5 both have tool-call training; vLLM exposes the spec.

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[...],
    tools=[{...}],
    tool_choice="auto",
)

Almost everything you do against the OpenAI/Anthropic API works against vLLM. The patterns from Anthropic Claude API + Tool Use carry over.

Scaling patterns

Multi-replica with sticky prefix routing

For chat with prefix caching, route the same conversation to the same replica so its KV cache hits. Use a hash of the conversation ID as the routing key.

Speculative decoding

Run a small “draft” model that proposes 4–8 tokens; the big model verifies them. 2–3× throughput on many workloads. vLLM supports it natively.

Disaggregated prefill/decode

Split prefill (compute-heavy, tokenize prompt) onto separate GPUs from decode (memory-heavy, generate tokens). Best at scale. Production setups in 2026 increasingly do this.

Quantization

If you haven’t quantized, you’re leaving 2–4× throughput on the table. AWQ, GPTQ, FP8 — all viable. Re-eval on your data after quantizing. Most quality drops are <2%.

Observability

# Wrap every call with structured logging + spans
@tracer.start_as_current_span("llm.complete")
async def complete(messages):
    span = trace.get_current_span()
    span.set_attribute("llm.model", MODEL)
    span.set_attribute("llm.input_tokens", count_tokens(messages))
    resp = await client.chat.completions.create(...)
    span.set_attribute("llm.output_tokens", resp.usage.completion_tokens)
    return resp

Track per-request:

Tokens in/out
Time-to-first-token
Total latency
Cache hit rate (vLLM exposes prefix cache hits in metrics)
Failure types (OOM, timeout, validation)

vLLM exposes Prometheus metrics. Wire them into your observability stack — see OpenTelemetry End-to-End .

Common mistakes

1. Underestimating ops cost

GPUs run hot. Drivers update. Models change shape between releases. Operating an LLM service is real work — typically a half-FTE for a non-trivial deployment.

2. Not benchmarking on real prompts

Lean benchmark numbers are 2k-token prompts. Your prompts are 8k. Your batching looks different. Measure on your traffic.

3. Picking 70B when 7B fits

A 7B fine-tuned on your data often beats a 70B prompted to do the same. Try the small model first, especially for narrow tasks.

4. No fallback to hosted

When your GPU pod restarts, what serves traffic? An app that can fall back to a hosted API during outages keeps users happy. Same SDK, different base URL.

5. Forgetting context length

Llama 3.3 70B has 128k context. Most prompts are 2k. KV cache for the unused 126k allocates VRAM you didn’t need. Set --max-model-len to a realistic value to free GPU memory for concurrency.

When I’d build it today

For a non-trivial company in 2026 looking at AI infra:

Build the app on Anthropic/OpenAI APIs first. Ship product. Measure cost.
At ~$30k/month spend on a single dominant workload, evaluate self-hosting that workload on Llama 3.3 70B.
Use a managed inference provider first (Together, Fireworks, Anyscale, AWS Bedrock for open models) before going to bare GPUs.
Self-host on bare GPUs only when the savings/control justify a half-time engineer’s work.

Should you self-host?#

The two engines that matter#

Ollama — for dev, prototypes, internal tools#

vLLM — for production#

SGLang — the alternative#

Picking a model in 2026#

Hardware sizing#

Throughput — what to expect#

A production deployment#

Treat it like an API#

Streaming, structured outputs, tool use#

Scaling patterns#

Multi-replica with sticky prefix routing#

Speculative decoding#

Disaggregated prefill/decode#

Quantization#

Observability#

Common mistakes#

1. Underestimating ops cost#

2. Not benchmarking on real prompts#

3. Picking 70B when 7B fits#

4. No fallback to hosted#

5. Forgetting context length#

When I’d build it today#

Read this next#