Local LLMs cheatsheet.
Why self-host
- Privacy (no data leaves).
- Cost at scale.
- No API rate limits.
- Offline.
- Custom fine-tunes.
Why NOT
- GPU required for decent speed.
- Quality typically below GPT-5 / Claude.
- Ops burden.
Ollama
Easiest local runner:
brew install ollama # or curl install
ollama serve
ollama pull llama4:8b
ollama pull qwen3:7b
ollama pull deepseek-r1:32b
ollama run llama4 "Hi"
ollama list
OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(model="llama4:8b", messages=[...])
vLLM (production)
Throughput-optimized server:
pip install vllm
vllm serve meta-llama/Llama-4-8B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-model-len 32768
OpenAI-compatible:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
Continuous batching, paged attention. Best throughput.
llama.cpp
CPU/Apple Silicon focused:
brew install llama.cpp
llama-server -m model.gguf --host 0.0.0.0 --port 8080
GGUF quantized models run on consumer hardware.
Hugging Face TGI
docker run --gpus all -p 8080:80 -v $(pwd)/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-4-8B-Instruct
Production-ready, supports quantization.
Quantization
- fp16 / bf16: half precision, ~2x smaller than fp32.
- int8: ~4x smaller.
- int4 (GGUF Q4_K_M): ~8x. Mild quality loss.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-8B", quantization_config=bnb_config)
Hardware
| Model size | VRAM (fp16) | VRAM (int4) |
|---|---|---|
| 7B | 14GB | 4GB |
| 13B | 26GB | 8GB |
| 70B | 140GB | 40GB |
| 405B | 800GB | 200GB |
Apple M-series: unified memory; 64GB Mac runs 70B int4.
Models worth trying (2026)
- Llama 4 (Meta): general, strong.
- Qwen 3 (Alibaba): coding, math.
- DeepSeek R1: reasoning.
- Mistral Small 3: efficient.
- Phi 4 (Microsoft): small + good.
- Gemma 3 (Google): on-device.
Open-LLM Leaderboard / lmsys arena for current ranks.
Speed
- t/s (tokens per second): ~50 for 7B on M-series, ~200 on A100.
- Prefill speed: prompt processing speed.
- Generation speed: output token rate.
Batching dramatically improves throughput (vLLM, TGI).
Embedding models
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
embeddings = model.encode(texts, batch_size=32)
Multi-modal
LLaVA / Qwen-VL run locally. Same Ollama API:
ollama run llava "describe this image" --image ./pic.jpg
Fine-tuning
- LoRA / QLoRA: cheap fine-tunes (run on consumer GPU).
- Axolotl: framework.
- Unsloth: 2x faster training.
pip install unsloth
# scripts at github.com/unslothai/unsloth
When to fine-tune vs prompt
Fine-tune when:
- Specific output format / style not achievable via prompt.
- Need to compress long prompts.
- High-volume use case (cost win).
Otherwise: prompts + few-shot are easier.
Inference behind nginx
location / {
proxy_pass http://localhost:8000;
proxy_buffering off;
proxy_read_timeout 300s;
}
Multi-tenant
vLLM handles multiple concurrent requests well. For tight isolation: separate processes / GPU partitions.
Cost vs API
Self-hosted breakeven varies. Rough: if spending $1k+/month on API for a workload that a 70B model can serve → consider self-host on RunPod / cloud GPU.
Common mistakes
- Running 70B on 8GB GPU → OOM.
- No batching → 1x throughput vs 10-100x with vLLM.
- Quantize too aggressive → quality cliff.
- Forgetting GPU monitoring → silent throttling.
- Treating local as drop-in for GPT-5 quality.
Read this next
If you want my Ollama + vLLM dev setup, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .