Self-hosting LLMs is appealing — fixed costs! No vendor lock-in! In practice, the math only works at sustained scale, on the right model, with operational discipline. This post is the honest economics.
The math
For Llama 3.1 70B-Instruct on an H100:
- Throughput: ~500-1500 tokens/sec (depends on quantization, batch size).
- GPU cost: H100 ~$2-3/hr on-demand; ~$1.50 spot.
- Cost per 1M tokens: $1-3 (rough).
For Sonnet API: $3/MTok input, $15/MTok output. Self-hosted Llama 70B beats input cost decisively at sustained load.
For Llama 3.1 8B on an A100:
- Throughput: ~3000-6000 tokens/sec.
- A100 cost: ~$1-1.50/hr.
- Cost per 1M tokens: $0.10-0.30.
5-15× cheaper than Sonnet — IF you can saturate the GPU.
The catch: utilization
H100 at 100% utilization: $2/hr → $0.50/Mtok (great).
H100 at 10% utilization: $2/hr → $5/Mtok (terrible).
If your traffic is bursty, the GPU sits idle. You’re paying for capacity, not usage.
API providers amortize across many customers; self-host doesn’t have that luxury.
Break-even calculation
Roughly:
Sonnet API: $3/MTok input.
H100 self-host (Llama 70B): ~$1/MTok at saturation; ~$10/MTok at 10% util.
Break-even: ~70-90% sustained GPU utilization for Llama 70B to beat Sonnet API.
Plus: SRE time, downtime, GPU procurement, model updates. Add 30-50% to the cost calc.
For most apps under 1B tokens/month: API wins. For 5B+ tokens/month sustained: self-host wins.
vLLM in production
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 256 \
--enable-prefix-caching \
--quantization awq
Key flags:
--tensor-parallel-size: split across N GPUs.--max-num-seqs: concurrent requests.--gpu-memory-utilization: how much of GPU memory to use.--enable-prefix-caching: KV cache reuse for common prefixes.--quantization: AWQ / GPTQ for memory savings.
Throughput dominates at high batch sizes; latency suffers. Tune for your workload.
Quantization
- FP16: full quality; 2 bytes/param.
- INT8: ~half memory; tiny quality loss.
- INT4 / AWQ / GPTQ: 4× smaller; small but visible quality loss.
For Llama 70B FP16: ~140GB memory (>1× H100). With INT4: ~40GB (fits 1 H100).
For most production: AWQ or GPTQ INT4. Quality is fine for most tasks; cost drops dramatically.
Inference servers
| Strengths | |
|---|---|
| vLLM | Highest throughput; broad model support; active dev |
| TGI (HuggingFace) | Mature; HF ecosystem |
| SGLang | Structured output speedup |
| TensorRT-LLM | Highest perf on NVIDIA; complex build |
| llama.cpp / Ollama | CPU/GPU; single-user; not production-multi-user |
For multi-user prod: vLLM is the default in 2026.
Hybrid: API + self-host
Common production pattern:
Frontier hard tasks: Anthropic / OpenAI API.
High-volume narrow tasks: Self-hosted fine-tuned 8B.
Embedding: cheap API or self-host.
Route by task. See LLM Routing .
Real example: a team I worked with replaced Sonnet on a 50M-tokens/day classifier with a fine-tuned Llama 8B on vLLM. Saved $40k/month. Training cost: $100. Self-hosted serving infra: $1500/month.
Operational concerns
- GPU procurement: H100s are still constrained; lead times.
- Failover: GPU dies; spare capacity? Multi-region?
- Monitoring: GPU utilization, memory, queue depth, p99 latency.
- Updates: model upgrades require coordinated deploys.
- Capacity planning: traffic grows; need more GPUs.
These are real ops costs. Budget an SRE-equivalent’s time.
Multi-tenant on shared GPUs
vllm serve <base-model> --enable-lora --max-loras 8 \
--lora-modules m1=/path/m1 m2=/path/m2 ...
Multiple LoRA adapters on one base model. Per-request adapter routing. Multi-tenant fine-tunes share GPU.
Big cost saver if you have many narrow fine-tunes.
Spot / preemptible GPUs
GPU on spot is 50-70% cheaper but interruptible. For inference:
- Hot-swap: kill spot instance; on-demand backup takes over.
- Mixed pool: 70% spot + 30% on-demand baseline.
- Stateless serving: spot fits fine for inference (no state to lose).
Standard pattern for cost reduction.
Model size sweet spots
- Llama 3.1 8B / Mistral 7B: fits one A100 / L4; great for fine-tuned narrow tasks.
- Llama 3.1 70B: needs A100/H100; competitive with Sonnet on many tasks.
- Llama 3.3 405B: matches frontier on many; requires multi-GPU; expensive to host.
- Mixtral 8x22B: MoE; smart middle ground.
For 2026: 8B fine-tuned is the sweet spot for most narrow tasks.
Embedding self-host
Embedding models (BAAI bge-m3, jina-v3, etc.) are tiny. ~1k tokens / 50ms on a single CPU or budget GPU.
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference \
--model-id BAAI/bge-m3
Self-hosted embeddings break even much lower than chat models (small model + high reuse).
Common mistakes
1. Self-host without measuring API spend
You assume self-host is cheaper. Run the math; you’re at $5k/month spend; self-host costs $3k for hardware + 1 SRE-day/month. Maybe wash.
2. Underutilized expensive GPU
H100 idle at 5% util because traffic is bursty. APIs win this scenario.
3. Ollama in production
Single-user inference engine. Multi-user load → terrible throughput. Use vLLM.
4. No quantization
FP16 only. Memory-constrained; serving fewer concurrent requests; cost-per-token high. AWQ/GPTQ for prod.
5. No fallback
Self-hosted goes down; app down. Always have an API fallback.
What I’d ship today
For self-hosted serving:
- vLLM for chat / completion.
- Text Embeddings Inference for embeddings.
- AWQ / GPTQ INT4 quantization.
- Spot instances + on-demand baseline.
- Multi-LoRA for multi-tenant fine-tunes.
- API fallback for outages.
- GPU monitoring in Prometheus.
For most apps: API for now; self-host one feature at a time once volume justifies.
Read this next
- Self-Hosted LLMs (vLLM, Ollama) 2026
- LLM Cost Optimization 2026
- LLM Fine-Tuning LoRA / QLoRA
- LLM Deployment Patterns 2026
If you want my vLLM production reference (config + monitoring + fallback), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .