AI/LLM Cheatsheet 05 — RAG Patterns

RAG cheatsheet.

Basic RAG pipeline

ingest:   documents → chunk → embed → store

query:    question → embed → search → top-k → prompt → LLM → answer

Minimal implementation

def rag(question, k=5):
    q_vec = embed(question)
    chunks = vector_db.search(q_vec, k=k)
    
    context = "\n\n".join(f"[{i}] {c.text}" for i, c in enumerate(chunks))
    
    response = llm(
        system="Answer using ONLY the context. If unknown, say 'not in context'.",
        user=f"Context:\n{context}\n\nQuestion: {question}",
    )
    return response, chunks

Chunking strategies

# Fixed-size
def chunk(text, size=500, overlap=50): ...

# Semantic (paragraph/section based)
import re
sections = re.split(r'\n##? ', text)

# Recursive (langchain)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(text)

# Code chunks
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(Language.PYTHON, chunk_size=500)

Metadata filters

Store doc_id, date, source, etc; filter at query time:

results = qdrant.search(
    "docs",
    query_vector=q_vec,
    query_filter={"must": [{"key": "source", "match": {"value": "docs.example.com"}}]},
    limit=5,
)

Hybrid search (vector + keyword)

vector_hits = vector_search(q_vec, k=20)
keyword_hits = bm25_search(query, k=20)

# Reciprocal rank fusion
def rrf(lists, k=60):
    scores = {}
    for lst in lists:
        for rank, item in enumerate(lst):
            scores[item.id] = scores.get(item.id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

hybrid = rrf([vector_hits, keyword_hits])[:10]

Reranking

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")

candidates = hybrid_search(query, k=50)
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]

Or Cohere Rerank / Voyage Rerank API.

HyDE (hypothetical document embedding)

Generate a hypothetical answer, then search for similar real docs:

hypothetical = llm(f"Answer briefly: {question}")
docs = vector_search(embed(hypothetical), k=5)

Works well when query is very different stylistically from corpus.

Query expansion

expansions = llm(f"Generate 3 query variations: {question}")
all_results = []
for q in [question] + expansions:
    all_results.extend(vector_search(embed(q), k=5))
# Dedup and rerank

Multi-query (RAG-Fusion)

Multiple queries → results merged via RRF → context for final answer.

Citations

Have LLM cite chunks by ID:

Context:
[1] Earth orbits the sun.
[2] Mars has two moons.

Question: How many moons does Mars have?
Answer: Mars has two moons [2].

Parse [N] from response → link to original chunks. Show in UI.

Anti-hallucination

You are a helpful assistant.
Use ONLY the context provided.
If the context doesn't contain the answer, respond exactly: "I don't know based on the provided context."
Always cite sources using [N] format.

Context length budgeting

def select_chunks(chunks, max_tokens=8000):
    selected = []
    used = 0
    for c in chunks:
        t = count_tokens(c.text)
        if used + t > max_tokens: break
        selected.append(c)
        used += t
    return selected

Conversational RAG

Track chat history. Sub-query for retrieval (rephrase using history):

search_query = llm(f"Given history: {history}\nQuestion: {q}\nWrite a standalone search query.")
chunks = retrieve(search_query)
answer = llm(history + [{"role": "user", "content": f"Context: {chunks}\n\n{q}"}])

Agentic RAG

LLM decides when to call retrieval tool:

tools = [{
    "name": "search_docs",
    "description": "Search documentation for relevant info",
    "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}},
}]

LLM may call multiple times with refined queries.

GraphRAG

Build knowledge graph from docs; query via graph + LLM. Microsoft GraphRAG, LightRAG, etc. Good for cross-document reasoning.

Evaluation

# Faithfulness: does answer use retrieved context?
# Answer relevance: does answer address question?
# Context recall: did retrieval find relevant chunks?
# Context precision: how much of retrieved is useful?

# Ragas library
from ragas import evaluate
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])

Caching

Cache:

Embeddings of queries (LRU).
Retrieval results.
LLM responses (semantic cache).

from functools import lru_cache

@lru_cache(maxsize=10000)
def cached_embed(text): return embed(text)

Tool integration

Don’t use RAG when:

Question has structured answer (SQL).
Real-time data (weather, prices) → API tool.
Computation → calculator tool.

Combine RAG + tools. Don’t shoehorn.

Document update strategy

def upsert_doc(doc_id, content):
    # Delete old chunks for doc_id
    vector_db.delete({"doc_id": doc_id})
    # Re-chunk and embed
    chunks = chunk(content)
    embeds = embed_batch([c.text for c in chunks])
    vector_db.upsert(...)

Common mistakes

Chunks too large → diluted similarity.
No reranking → wrong chunks reach LLM.
LLM ignores citations → hallucination unchecked.
Re-embedding query with different model than corpus.
Ignoring metadata filters when applicable.

Basic RAG pipeline#

Minimal implementation#

Chunking strategies#

Metadata filters#

Hybrid search (vector + keyword)#

Reranking#

HyDE (hypothetical document embedding)#

Query expansion#

Multi-query (RAG-Fusion)#

Citations#

Anti-hallucination#

Context length budgeting#

Conversational RAG#

Agentic RAG#

GraphRAG#

Evaluation#

Caching#

Tool integration#

Document update strategy#

Common mistakes#

Read this next#