AI/LLM Cheatsheet 12 — Local LLMs (Ollama, vLLM)

Local LLMs cheatsheet.

Why self-host

Privacy (no data leaves).
Cost at scale.
No API rate limits.
Offline.
Custom fine-tunes.

Why NOT

GPU required for decent speed.
Quality typically below GPT-5 / Claude.
Ops burden.

Ollama

Easiest local runner:

brew install ollama          # or curl install
ollama serve

ollama pull llama4:8b
ollama pull qwen3:7b
ollama pull deepseek-r1:32b

ollama run llama4 "Hi"
ollama list

OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(model="llama4:8b", messages=[...])

vLLM (production)

Throughput-optimized server:

pip install vllm
vllm serve meta-llama/Llama-4-8B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

OpenAI-compatible:

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

Continuous batching, paged attention. Best throughput.

llama.cpp

CPU/Apple Silicon focused:

brew install llama.cpp
llama-server -m model.gguf --host 0.0.0.0 --port 8080

GGUF quantized models run on consumer hardware.

Hugging Face TGI

docker run --gpus all -p 8080:80 -v $(pwd)/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-4-8B-Instruct

Production-ready, supports quantization.

Quantization

fp16 / bf16: half precision, ~2x smaller than fp32.
int8: ~4x smaller.
int4 (GGUF Q4_K_M): ~8x. Mild quality loss.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-8B", quantization_config=bnb_config)

Hardware

Model size	VRAM (fp16)	VRAM (int4)
7B	14GB	4GB
13B	26GB	8GB
70B	140GB	40GB
405B	800GB	200GB

Apple M-series: unified memory; 64GB Mac runs 70B int4.

Models worth trying (2026)

Llama 4 (Meta): general, strong.
Qwen 3 (Alibaba): coding, math.
DeepSeek R1: reasoning.
Mistral Small 3: efficient.
Phi 4 (Microsoft): small + good.
Gemma 3 (Google): on-device.

Open-LLM Leaderboard / lmsys arena for current ranks.

Speed

t/s (tokens per second): ~50 for 7B on M-series, ~200 on A100.
Prefill speed: prompt processing speed.
Generation speed: output token rate.

Batching dramatically improves throughput (vLLM, TGI).

Embedding models

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
embeddings = model.encode(texts, batch_size=32)

LLaVA / Qwen-VL run locally. Same Ollama API:

ollama run llava "describe this image" --image ./pic.jpg

Fine-tuning

LoRA / QLoRA: cheap fine-tunes (run on consumer GPU).
Axolotl: framework.
Unsloth: 2x faster training.

pip install unsloth
# scripts at github.com/unslothai/unsloth

When to fine-tune vs prompt

Fine-tune when:

Specific output format / style not achievable via prompt.
Need to compress long prompts.
High-volume use case (cost win).

Otherwise: prompts + few-shot are easier.

Inference behind nginx

location / {
    proxy_pass http://localhost:8000;
    proxy_buffering off;
    proxy_read_timeout 300s;
}

Multi-tenant

vLLM handles multiple concurrent requests well. For tight isolation: separate processes / GPU partitions.

Cost vs API

Self-hosted breakeven varies. Rough: if spending $1k+/month on API for a workload that a 70B model can serve → consider self-host on RunPod / cloud GPU.

Common mistakes

Running 70B on 8GB GPU → OOM.
No batching → 1x throughput vs 10-100x with vLLM.
Quantize too aggressive → quality cliff.
Forgetting GPU monitoring → silent throttling.
Treating local as drop-in for GPT-5 quality.

Why self-host#

Why NOT#

Ollama#

vLLM (production)#

llama.cpp#

Hugging Face TGI#

Quantization#

Hardware#

Models worth trying (2026)#

Speed#

Embedding models#

Multi-modal#

Fine-tuning#

When to fine-tune vs prompt#

Inference behind nginx#

Multi-tenant#

Cost vs API#

Common mistakes#

Read this next#