RAG cheatsheet.
Basic RAG pipeline
ingest: documents → chunk → embed → store
query: question → embed → search → top-k → prompt → LLM → answer
Minimal implementation
def rag(question, k=5):
q_vec = embed(question)
chunks = vector_db.search(q_vec, k=k)
context = "\n\n".join(f"[{i}] {c.text}" for i, c in enumerate(chunks))
response = llm(
system="Answer using ONLY the context. If unknown, say 'not in context'.",
user=f"Context:\n{context}\n\nQuestion: {question}",
)
return response, chunks
Chunking strategies
# Fixed-size
def chunk(text, size=500, overlap=50): ...
# Semantic (paragraph/section based)
import re
sections = re.split(r'\n##? ', text)
# Recursive (langchain)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(text)
# Code chunks
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(Language.PYTHON, chunk_size=500)
Metadata filters
Store doc_id, date, source, etc; filter at query time:
results = qdrant.search(
"docs",
query_vector=q_vec,
query_filter={"must": [{"key": "source", "match": {"value": "docs.example.com"}}]},
limit=5,
)
Hybrid search (vector + keyword)
vector_hits = vector_search(q_vec, k=20)
keyword_hits = bm25_search(query, k=20)
# Reciprocal rank fusion
def rrf(lists, k=60):
scores = {}
for lst in lists:
for rank, item in enumerate(lst):
scores[item.id] = scores.get(item.id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)
hybrid = rrf([vector_hits, keyword_hits])[:10]
Reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")
candidates = hybrid_search(query, k=50)
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]
Or Cohere Rerank / Voyage Rerank API.
HyDE (hypothetical document embedding)
Generate a hypothetical answer, then search for similar real docs:
hypothetical = llm(f"Answer briefly: {question}")
docs = vector_search(embed(hypothetical), k=5)
Works well when query is very different stylistically from corpus.
Query expansion
expansions = llm(f"Generate 3 query variations: {question}")
all_results = []
for q in [question] + expansions:
all_results.extend(vector_search(embed(q), k=5))
# Dedup and rerank
Multi-query (RAG-Fusion)
Multiple queries → results merged via RRF → context for final answer.
Citations
Have LLM cite chunks by ID:
Context:
[1] Earth orbits the sun.
[2] Mars has two moons.
Question: How many moons does Mars have?
Answer: Mars has two moons [2].
Parse [N] from response → link to original chunks. Show in UI.
Anti-hallucination
You are a helpful assistant.
Use ONLY the context provided.
If the context doesn't contain the answer, respond exactly: "I don't know based on the provided context."
Always cite sources using [N] format.
Context length budgeting
def select_chunks(chunks, max_tokens=8000):
selected = []
used = 0
for c in chunks:
t = count_tokens(c.text)
if used + t > max_tokens: break
selected.append(c)
used += t
return selected
Conversational RAG
Track chat history. Sub-query for retrieval (rephrase using history):
search_query = llm(f"Given history: {history}\nQuestion: {q}\nWrite a standalone search query.")
chunks = retrieve(search_query)
answer = llm(history + [{"role": "user", "content": f"Context: {chunks}\n\n{q}"}])
Agentic RAG
LLM decides when to call retrieval tool:
tools = [{
"name": "search_docs",
"description": "Search documentation for relevant info",
"input_schema": {"type": "object", "properties": {"query": {"type": "string"}}},
}]
LLM may call multiple times with refined queries.
GraphRAG
Build knowledge graph from docs; query via graph + LLM. Microsoft GraphRAG, LightRAG, etc. Good for cross-document reasoning.
Evaluation
# Faithfulness: does answer use retrieved context?
# Answer relevance: does answer address question?
# Context recall: did retrieval find relevant chunks?
# Context precision: how much of retrieved is useful?
# Ragas library
from ragas import evaluate
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
Caching
Cache:
- Embeddings of queries (LRU).
- Retrieval results.
- LLM responses (semantic cache).
from functools import lru_cache
@lru_cache(maxsize=10000)
def cached_embed(text): return embed(text)
Tool integration
Don’t use RAG when:
- Question has structured answer (SQL).
- Real-time data (weather, prices) → API tool.
- Computation → calculator tool.
Combine RAG + tools. Don’t shoehorn.
Document update strategy
def upsert_doc(doc_id, content):
# Delete old chunks for doc_id
vector_db.delete({"doc_id": doc_id})
# Re-chunk and embed
chunks = chunk(content)
embeds = embed_batch([c.text for c in chunks])
vector_db.upsert(...)
Common mistakes
- Chunks too large → diluted similarity.
- No reranking → wrong chunks reach LLM.
- LLM ignores citations → hallucination unchecked.
- Re-embedding query with different model than corpus.
- Ignoring metadata filters when applicable.
Read this next
If you want my RAG stack (Qdrant + reranker + citations), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .