AI Engineering

Posts on AI engineering — the discipline of building real products on top of LLMs. Practical writing on RAG, agents, prompt engineering, vector databases, evaluations, and the production realities of shipping AI features that don’t fall apart in week three.

Self-Hosting LLMs in 2026 — When the Math Actually Works

Practical LLM self-hosting math: GPU pricing, throughput per GPU, sustained load break-even, vLLM tuning, and when API still wins.

Anthropic API Best Practices in 2026 — Caching, Tool Use, Streaming, and Production Patterns

Practical Anthropic API: prompt caching tactics, tool use loops, streaming, batch API, retries, and pitfalls from real production deployments.

Evaluating AI Coding Tools in 2026 — Benchmarks That Matter and Ones That Don't

Practical AI coding eval: SWE-bench / live benchmarks, internal benchmarks on your codebase, productivity metrics, and what to ignore.

Synthetic Data with LLMs in 2026 — Use Cases, Risks, and the Patterns That Work

Practical synthetic data: fine-tune training data, eval set generation, edge case enumeration, and the model-collapse / quality risks to watch.

Voice Agents in 2026 — STT, LLM, TTS, and Latency That Doesn't Hurt

Practical voice agent architecture: streaming Deepgram/AssemblyAI → LLM → ElevenLabs/OpenAI TTS, latency budgeting, barge-in, and patterns from production calls.

Model Context Protocol (MCP) in 2026 — What It Solved, What It Didn't

Practical MCP: building an MCP server, integrating with Claude / Cursor, when MCP wins, and the security pitfalls of remote tool access.

LLM Tool Use Patterns in 2026 — Schemas, Validation, and the Loop

Practical LLM tool use: schema design, parallel tool calls, error/retry on bad inputs, tool result formatting, and patterns that scale beyond 5 tools.

Agentic Coding in 2026 — Claude Code, Cursor, and the Real Workflow

Honest take on AI coding agents: where Claude Code / Cursor shine, when they hurt, the discipline of using them well, and what stays human.

LLM Batch Processing in 2026 — Anthropic / OpenAI Batch API for 50% Off

Practical LLM batch processing: when 24-hour latency is fine, queueing patterns, retry logic, error handling, and integrating batches with online apps.

LLM Deployment Patterns in 2026 — Inference Servers, Routing, and Production Architectures

Practical LLM deployment: vLLM / TGI for self-hosted, hybrid (API + local), routing layers, autoscaling GPUs, fallbacks, and serving cost economics.