Posts on AI engineering — the discipline of building real products on top of LLMs. Practical writing on RAG, agents, prompt engineering, vector databases, evaluations, and the production realities of shipping AI features that don’t fall apart in week three.
Practical LLM self-hosting math: GPU pricing, throughput per GPU, sustained load break-even, vLLM tuning, and when API still wins.
Practical Anthropic API: prompt caching tactics, tool use loops, streaming, batch API, retries, and pitfalls from real production deployments.
Practical AI coding eval: SWE-bench / live benchmarks, internal benchmarks on your codebase, productivity metrics, and what to ignore.
Practical synthetic data: fine-tune training data, eval set generation, edge case enumeration, and the model-collapse / quality risks to watch.
Practical voice agent architecture: streaming Deepgram/AssemblyAI → LLM → ElevenLabs/OpenAI TTS, latency budgeting, barge-in, and patterns from production calls.
Practical MCP: building an MCP server, integrating with Claude / Cursor, when MCP wins, and the security pitfalls of remote tool access.
Practical LLM tool use: schema design, parallel tool calls, error/retry on bad inputs, tool result formatting, and patterns that scale beyond 5 tools.
Honest take on AI coding agents: where Claude Code / Cursor shine, when they hurt, the discipline of using them well, and what stays human.
Practical LLM batch processing: when 24-hour latency is fine, queueing patterns, retry logic, error handling, and integrating batches with online apps.
Practical LLM deployment: vLLM / TGI for self-hosted, hybrid (API + local), routing layers, autoscaling GPUs, fallbacks, and serving cost economics.