Fine-Tuning LLMs in 2026 — LoRA, QLoRA, and the Cheap Path to Specialized Models
How to actually fine-tune LLMs in 2026 — LoRA / QLoRA mechanics, training data discipline, evaluation, and the patterns that make fine-tunes ship.
How to actually fine-tune LLMs in 2026 — LoRA / QLoRA mechanics, training data discipline, evaluation, and the patterns that make fine-tunes ship.
Production agent error handling. Per-tool retries vs whole-agent retries, fallback paths, step caps, escalation, human-in-the-loop, and the patterns from real agent deployments.
Picking a frontier LLM provider in 2026. Model quality across reasoning / coding / extraction, pricing, latency, ecosystem maturity, and which fits which workload.
Document AI in 2026: vision LLMs (Claude, GPT-4o, Gemini), classical OCR (Tesseract / Textract), layout models, and the production patterns for invoices, receipts, contracts.
How prompt caching actually works at Anthropic and OpenAI, where to place breakpoints for max hit rate, measuring cache effectiveness, and the patterns that compound across calls.
Comparison of LLM eval frameworks: Braintrust (ship-eval-with-code), LangSmith (LangChain-native), Ragas (RAG-specific), DeepEval (Pytest-style). Which to pick by team.
Tool design for agents — names, descriptions as prompts, input schemas, error handling, idempotency, and the patterns that make agents call them correctly.
Context engineering — what goes in the context window, in what order, and why. The patterns that separate working agents from confused ones.
Production LLM streaming with cancellation. SSE plus abort, client cancel propagating to provider, partial-response handling, and the patterns that save real tokens.
Most LLM apps run every query on the most expensive model. Routing with a small classifier sends easy queries to Haiku and reserves Opus for hard ones. The pattern, the math, and the implementation.