AI/LLM Cheatsheet 05 — RAG Patterns
Cheatsheet: RAG pipeline, chunking, retrieval, rerank, citations.
Cheatsheet: RAG pipeline, chunking, retrieval, rerank, citations.
How to actually evaluate RAG: retrieval recall and MRR, answer faithfulness and relevance, golden datasets, automated eval pipelines, and Ragas.
Production hybrid search with Postgres alone: pgvector for semantic, tsvector for lexical, RRF fusion for combining, optional reranker. Performance, tuning, and patterns.
Why agents need memory beyond the context window, the 2026 tools (Mem0, Zep, custom layers), summary vs episodic memory, retrieval, and the patterns from production agents.
How to actually use 1M-token context windows. The ‘just put it all in context’ temptation, when it works, when RAG still wins, prompt caching, and cost.
Why agentic RAG often beats one-shot RAG. Tool-based retrieval, decomposition, query rewriting, self-reflection, citations, and the production patterns that ship in 2026.
Rerankers turn ‘pretty good RAG’ into ‘great RAG’ for one extra API call. Cross-encoders explained, Cohere Rerank vs BGE-Reranker vs Jina, two-stage retrieval architecture, latency, cost, and implementation.
A practical 2026 guide to picking an embedding model. OpenAI text-embedding-3 vs Voyage vs Cohere vs open BGE / Nomic. Quality on MTEB, cost, dimensions, multilingual, and how to evaluate on your own data.
A practical 2026 decision guide for LLM teams. When fine-tuning earns its cost, when RAG is right, when prompting is enough, the hybrid patterns, and the ops realities that change which one fits.
A complete, end-to-end RAG backend built on PostgreSQL + pgvector and FastAPI. Real chunking, real embeddings, hybrid (vector + BM25) retrieval, prompt assembly, citations, and production gotchas.