Self-hosting LLMs is appealing — fixed costs! No vendor lock-in! In practice, the math only works at sustained scale, on the right model, with operational discipline. This post is the honest economics.

The math

For Llama 3.1 70B-Instruct on an H100:

  • Throughput: ~500-1500 tokens/sec (depends on quantization, batch size).
  • GPU cost: H100 ~$2-3/hr on-demand; ~$1.50 spot.
  • Cost per 1M tokens: $1-3 (rough).

For Sonnet API: $3/MTok input, $15/MTok output. Self-hosted Llama 70B beats input cost decisively at sustained load.

For Llama 3.1 8B on an A100:

  • Throughput: ~3000-6000 tokens/sec.
  • A100 cost: ~$1-1.50/hr.
  • Cost per 1M tokens: $0.10-0.30.

5-15× cheaper than Sonnet — IF you can saturate the GPU.

The catch: utilization

H100 at 100% utilization: $2/hr → $0.50/Mtok (great).
H100 at 10% utilization: $2/hr → $5/Mtok (terrible).

If your traffic is bursty, the GPU sits idle. You’re paying for capacity, not usage.

API providers amortize across many customers; self-host doesn’t have that luxury.

Break-even calculation

Roughly:

Sonnet API: $3/MTok input.
H100 self-host (Llama 70B): ~$1/MTok at saturation; ~$10/MTok at 10% util.

Break-even: ~70-90% sustained GPU utilization for Llama 70B to beat Sonnet API.

Plus: SRE time, downtime, GPU procurement, model updates. Add 30-50% to the cost calc.

For most apps under 1B tokens/month: API wins. For 5B+ tokens/month sustained: self-host wins.

vLLM in production

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256 \
    --enable-prefix-caching \
    --quantization awq

Key flags:

  • --tensor-parallel-size: split across N GPUs.
  • --max-num-seqs: concurrent requests.
  • --gpu-memory-utilization: how much of GPU memory to use.
  • --enable-prefix-caching: KV cache reuse for common prefixes.
  • --quantization: AWQ / GPTQ for memory savings.

Throughput dominates at high batch sizes; latency suffers. Tune for your workload.

Quantization

  • FP16: full quality; 2 bytes/param.
  • INT8: ~half memory; tiny quality loss.
  • INT4 / AWQ / GPTQ: 4× smaller; small but visible quality loss.

For Llama 70B FP16: ~140GB memory (>1× H100). With INT4: ~40GB (fits 1 H100).

For most production: AWQ or GPTQ INT4. Quality is fine for most tasks; cost drops dramatically.

Inference servers

Strengths
vLLMHighest throughput; broad model support; active dev
TGI (HuggingFace)Mature; HF ecosystem
SGLangStructured output speedup
TensorRT-LLMHighest perf on NVIDIA; complex build
llama.cpp / OllamaCPU/GPU; single-user; not production-multi-user

For multi-user prod: vLLM is the default in 2026.

Hybrid: API + self-host

Common production pattern:

Frontier hard tasks: Anthropic / OpenAI API.
High-volume narrow tasks: Self-hosted fine-tuned 8B.
Embedding: cheap API or self-host.

Route by task. See LLM Routing .

Real example: a team I worked with replaced Sonnet on a 50M-tokens/day classifier with a fine-tuned Llama 8B on vLLM. Saved $40k/month. Training cost: $100. Self-hosted serving infra: $1500/month.

Operational concerns

  • GPU procurement: H100s are still constrained; lead times.
  • Failover: GPU dies; spare capacity? Multi-region?
  • Monitoring: GPU utilization, memory, queue depth, p99 latency.
  • Updates: model upgrades require coordinated deploys.
  • Capacity planning: traffic grows; need more GPUs.

These are real ops costs. Budget an SRE-equivalent’s time.

Multi-tenant on shared GPUs

vllm serve <base-model> --enable-lora --max-loras 8 \
  --lora-modules m1=/path/m1 m2=/path/m2 ...

Multiple LoRA adapters on one base model. Per-request adapter routing. Multi-tenant fine-tunes share GPU.

Big cost saver if you have many narrow fine-tunes.

Spot / preemptible GPUs

GPU on spot is 50-70% cheaper but interruptible. For inference:

  • Hot-swap: kill spot instance; on-demand backup takes over.
  • Mixed pool: 70% spot + 30% on-demand baseline.
  • Stateless serving: spot fits fine for inference (no state to lose).

Standard pattern for cost reduction.

Model size sweet spots

  • Llama 3.1 8B / Mistral 7B: fits one A100 / L4; great for fine-tuned narrow tasks.
  • Llama 3.1 70B: needs A100/H100; competitive with Sonnet on many tasks.
  • Llama 3.3 405B: matches frontier on many; requires multi-GPU; expensive to host.
  • Mixtral 8x22B: MoE; smart middle ground.

For 2026: 8B fine-tuned is the sweet spot for most narrow tasks.

Embedding self-host

Embedding models (BAAI bge-m3, jina-v3, etc.) are tiny. ~1k tokens / 50ms on a single CPU or budget GPU.

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference \
  --model-id BAAI/bge-m3

Self-hosted embeddings break even much lower than chat models (small model + high reuse).

Common mistakes

1. Self-host without measuring API spend

You assume self-host is cheaper. Run the math; you’re at $5k/month spend; self-host costs $3k for hardware + 1 SRE-day/month. Maybe wash.

2. Underutilized expensive GPU

H100 idle at 5% util because traffic is bursty. APIs win this scenario.

3. Ollama in production

Single-user inference engine. Multi-user load → terrible throughput. Use vLLM.

4. No quantization

FP16 only. Memory-constrained; serving fewer concurrent requests; cost-per-token high. AWQ/GPTQ for prod.

5. No fallback

Self-hosted goes down; app down. Always have an API fallback.

What I’d ship today

For self-hosted serving:

  • vLLM for chat / completion.
  • Text Embeddings Inference for embeddings.
  • AWQ / GPTQ INT4 quantization.
  • Spot instances + on-demand baseline.
  • Multi-LoRA for multi-tenant fine-tunes.
  • API fallback for outages.
  • GPU monitoring in Prometheus.

For most apps: API for now; self-host one feature at a time once volume justifies.

Read this next

If you want my vLLM production reference (config + monitoring + fallback), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .