Local LLMs cheatsheet.

Why self-host

  • Privacy (no data leaves).
  • Cost at scale.
  • No API rate limits.
  • Offline.
  • Custom fine-tunes.

Why NOT

  • GPU required for decent speed.
  • Quality typically below GPT-5 / Claude.
  • Ops burden.

Ollama

Easiest local runner:

brew install ollama          # or curl install
ollama serve

ollama pull llama4:8b
ollama pull qwen3:7b
ollama pull deepseek-r1:32b

ollama run llama4 "Hi"
ollama list

OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(model="llama4:8b", messages=[...])

vLLM (production)

Throughput-optimized server:

pip install vllm
vllm serve meta-llama/Llama-4-8B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

OpenAI-compatible:

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

Continuous batching, paged attention. Best throughput.

llama.cpp

CPU/Apple Silicon focused:

brew install llama.cpp
llama-server -m model.gguf --host 0.0.0.0 --port 8080

GGUF quantized models run on consumer hardware.

Hugging Face TGI

docker run --gpus all -p 8080:80 -v $(pwd)/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-4-8B-Instruct

Production-ready, supports quantization.

Quantization

  • fp16 / bf16: half precision, ~2x smaller than fp32.
  • int8: ~4x smaller.
  • int4 (GGUF Q4_K_M): ~8x. Mild quality loss.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-8B", quantization_config=bnb_config)

Hardware

Model sizeVRAM (fp16)VRAM (int4)
7B14GB4GB
13B26GB8GB
70B140GB40GB
405B800GB200GB

Apple M-series: unified memory; 64GB Mac runs 70B int4.

Models worth trying (2026)

  • Llama 4 (Meta): general, strong.
  • Qwen 3 (Alibaba): coding, math.
  • DeepSeek R1: reasoning.
  • Mistral Small 3: efficient.
  • Phi 4 (Microsoft): small + good.
  • Gemma 3 (Google): on-device.

Open-LLM Leaderboard / lmsys arena for current ranks.

Speed

  • t/s (tokens per second): ~50 for 7B on M-series, ~200 on A100.
  • Prefill speed: prompt processing speed.
  • Generation speed: output token rate.

Batching dramatically improves throughput (vLLM, TGI).

Embedding models

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
embeddings = model.encode(texts, batch_size=32)

Multi-modal

LLaVA / Qwen-VL run locally. Same Ollama API:

ollama run llava "describe this image" --image ./pic.jpg

Fine-tuning

  • LoRA / QLoRA: cheap fine-tunes (run on consumer GPU).
  • Axolotl: framework.
  • Unsloth: 2x faster training.
pip install unsloth
# scripts at github.com/unslothai/unsloth

When to fine-tune vs prompt

Fine-tune when:

  • Specific output format / style not achievable via prompt.
  • Need to compress long prompts.
  • High-volume use case (cost win).

Otherwise: prompts + few-shot are easier.

Inference behind nginx

location / {
    proxy_pass http://localhost:8000;
    proxy_buffering off;
    proxy_read_timeout 300s;
}

Multi-tenant

vLLM handles multiple concurrent requests well. For tight isolation: separate processes / GPU partitions.

Cost vs API

Self-hosted breakeven varies. Rough: if spending $1k+/month on API for a workload that a 70B model can serve → consider self-host on RunPod / cloud GPU.

Common mistakes

  • Running 70B on 8GB GPU → OOM.
  • No batching → 1x throughput vs 10-100x with vLLM.
  • Quantize too aggressive → quality cliff.
  • Forgetting GPU monitoring → silent throttling.
  • Treating local as drop-in for GPT-5 quality.

Read this next

If you want my Ollama + vLLM dev setup, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .