When does self-hosting break even?

Around 1B tokens/month sustained on a single model. Below that, frontier APIs are cheaper after you count GPU ops + downtime + SRE time. Above that, self-hosting saves real money.

vLLM for production (highest throughput; multi-GPU; tensor parallelism). Ollama for local dev / personal / single-user. Don't use Ollama in prod for multi-user load.

Self-Hosting LLMs in 2026 — When the Math Actually Works

Self-hosting LLMs is appealing — fixed costs! No vendor lock-in! In practice, the math only works at sustained scale, on the right model, with operational discipline. This post is the honest economics.

The math

For Llama 3.1 70B-Instruct on an H100:

Throughput: ~500-1500 tokens/sec (depends on quantization, batch size).
GPU cost: H100 ~$2-3/hr on-demand; ~$1.50 spot.
Cost per 1M tokens: $1-3 (rough).

For Sonnet API: $3/MTok input, $15/MTok output. Self-hosted Llama 70B beats input cost decisively at sustained load.

For Llama 3.1 8B on an A100:

Throughput: ~3000-6000 tokens/sec.
A100 cost: ~$1-1.50/hr.
Cost per 1M tokens: $0.10-0.30.

5-15× cheaper than Sonnet — IF you can saturate the GPU.

The catch: utilization

H100 at 100% utilization: $2/hr → $0.50/Mtok (great).
H100 at 10% utilization: $2/hr → $5/Mtok (terrible).

If your traffic is bursty, the GPU sits idle. You’re paying for capacity, not usage.

API providers amortize across many customers; self-host doesn’t have that luxury.

Break-even calculation

Roughly:

Sonnet API: $3/MTok input.
H100 self-host (Llama 70B): ~$1/MTok at saturation; ~$10/MTok at 10% util.

Break-even: ~70-90% sustained GPU utilization for Llama 70B to beat Sonnet API.

Plus: SRE time, downtime, GPU procurement, model updates. Add 30-50% to the cost calc.

For most apps under 1B tokens/month: API wins. For 5B+ tokens/month sustained: self-host wins.

vLLM in production

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256 \
    --enable-prefix-caching \
    --quantization awq

Key flags:

--tensor-parallel-size: split across N GPUs.
--max-num-seqs: concurrent requests.
--gpu-memory-utilization: how much of GPU memory to use.
--enable-prefix-caching: KV cache reuse for common prefixes.
--quantization: AWQ / GPTQ for memory savings.

Throughput dominates at high batch sizes; latency suffers. Tune for your workload.

Quantization

FP16: full quality; 2 bytes/param.
INT8: ~half memory; tiny quality loss.
INT4 / AWQ / GPTQ: 4× smaller; small but visible quality loss.

For Llama 70B FP16: ~140GB memory (>1× H100). With INT4: ~40GB (fits 1 H100).

For most production: AWQ or GPTQ INT4. Quality is fine for most tasks; cost drops dramatically.

Inference servers

	Strengths
vLLM	Highest throughput; broad model support; active dev
TGI (HuggingFace)	Mature; HF ecosystem
SGLang	Structured output speedup
TensorRT-LLM	Highest perf on NVIDIA; complex build
llama.cpp / Ollama	CPU/GPU; single-user; not production-multi-user

For multi-user prod: vLLM is the default in 2026.

Hybrid: API + self-host

Common production pattern:

Frontier hard tasks: Anthropic / OpenAI API.
High-volume narrow tasks: Self-hosted fine-tuned 8B.
Embedding: cheap API or self-host.

Route by task. See LLM Routing .

Real example: a team I worked with replaced Sonnet on a 50M-tokens/day classifier with a fine-tuned Llama 8B on vLLM. Saved $40k/month. Training cost: $100. Self-hosted serving infra: $1500/month.

Operational concerns

GPU procurement: H100s are still constrained; lead times.
Failover: GPU dies; spare capacity? Multi-region?
Monitoring: GPU utilization, memory, queue depth, p99 latency.
Updates: model upgrades require coordinated deploys.
Capacity planning: traffic grows; need more GPUs.

These are real ops costs. Budget an SRE-equivalent’s time.

Multi-tenant on shared GPUs

vllm serve <base-model> --enable-lora --max-loras 8 \
  --lora-modules m1=/path/m1 m2=/path/m2 ...

Multiple LoRA adapters on one base model. Per-request adapter routing. Multi-tenant fine-tunes share GPU.

Big cost saver if you have many narrow fine-tunes.

Spot / preemptible GPUs

GPU on spot is 50-70% cheaper but interruptible. For inference:

Hot-swap: kill spot instance; on-demand backup takes over.
Mixed pool: 70% spot + 30% on-demand baseline.
Stateless serving: spot fits fine for inference (no state to lose).

Standard pattern for cost reduction.

Model size sweet spots

Llama 3.1 8B / Mistral 7B: fits one A100 / L4; great for fine-tuned narrow tasks.
Llama 3.1 70B: needs A100/H100; competitive with Sonnet on many tasks.
Llama 3.3 405B: matches frontier on many; requires multi-GPU; expensive to host.
Mixtral 8x22B: MoE; smart middle ground.

For 2026: 8B fine-tuned is the sweet spot for most narrow tasks.

Embedding self-host

Embedding models (BAAI bge-m3, jina-v3, etc.) are tiny. ~1k tokens / 50ms on a single CPU or budget GPU.

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference \
  --model-id BAAI/bge-m3

Self-hosted embeddings break even much lower than chat models (small model + high reuse).

Common mistakes

1. Self-host without measuring API spend

You assume self-host is cheaper. Run the math; you’re at $5k/month spend; self-host costs $3k for hardware + 1 SRE-day/month. Maybe wash.

2. Underutilized expensive GPU

H100 idle at 5% util because traffic is bursty. APIs win this scenario.

3. Ollama in production

Single-user inference engine. Multi-user load → terrible throughput. Use vLLM.

4. No quantization

FP16 only. Memory-constrained; serving fewer concurrent requests; cost-per-token high. AWQ/GPTQ for prod.

5. No fallback

Self-hosted goes down; app down. Always have an API fallback.

What I’d ship today

For self-hosted serving:

vLLM for chat / completion.
Text Embeddings Inference for embeddings.
AWQ / GPTQ INT4 quantization.
Spot instances + on-demand baseline.
Multi-LoRA for multi-tenant fine-tunes.
API fallback for outages.
GPU monitoring in Prometheus.

For most apps: API for now; self-host one feature at a time once volume justifies.

Read this next

If you want my vLLM production reference (config + monitoring + fallback), it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The math#

The catch: utilization#

Break-even calculation#

vLLM in production#

Quantization#

Inference servers#

Hybrid: API + self-host#

Operational concerns#

Multi-tenant on shared GPUs#

Spot / preemptible GPUs#

Model size sweet spots#

Embedding self-host#

Common mistakes#

1. Self-host without measuring API spend#

2. Underutilized expensive GPU#

3. Ollama in production#

4. No quantization#

5. No fallback#

What I’d ship today#

Read this next#