Naive RAG works on simple questions: one query, one retrieval, one answer. The moment a real user asks “compare how the docs say to handle errors versus what the code actually does,” it falls apart. Agentic RAG is the response — let the LLM drive retrieval as a tool, with multi-step reasoning, decomposition, and self-reflection.
What changes
| Naive RAG | Agentic RAG | |
|---|---|---|
| Retrievals per query | 1 | 1–10 |
| LLM calls per query | 1 | 2–10 |
| Decision-maker | Pipeline | LLM |
| Best for | Direct lookups | Multi-step, comparative |
| Cost | 1× | 2–5× |
| Latency | 1× | 3–6× |
The cost increase buys you handling questions naive RAG simply gets wrong.
The shape
User question
↓
LLM with retrieve() tool
↓
Plan / decompose
↓
Loop: retrieve(sub-q) → think → retrieve(next) ...
↓
Synthesize from results
↓
Self-reflect: did I answer fully?
↓
Final answer with citations
The LLM is the orchestrator. Retrieval is one tool among others.
LangGraph implementation
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
@tool
async def retrieve(query: str, k: int = 5) -> str:
docs = await pgvector_search(query, k=k)
return "\n\n".join(f"[{d.id}] {d.content}" for d in docs)
@tool
async def web_search(query: str) -> str:
return await tavily_search(query, max_results=5)
tools = [retrieve, web_search]
llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0).bind_tools(tools)
def call_model(state):
msg = llm.invoke(state["messages"])
return {"messages": [msg]}
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", tools_condition, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")
agent = graph.compile()
The agent decides whether to retrieve, what to retrieve, and when it has enough. See AI Agents with LangGraph for the LangGraph mechanics.
Patterns that ship
1. Query rewriting
The user’s “what about that bug from yesterday?” doesn’t retrieve well alone. The agent rewrites with conversation context.
2. Decomposition
The user asks “how do A, B, and C compare?” Agent issues three retrievals, one per sub-question, then synthesizes.
3. Self-reflection
After answering, the agent asks itself: “Did I cover everything?” If no, retrieve more. Add to system prompt:
After your initial answer, check whether the user’s question is fully addressed. If not, identify what’s missing and call retrieve() for it.
4. Citations
Every claim cites the chunk that supported it. The model emits [id] markers; the wrapper looks up URLs server-side. See Build a RAG App with pgvector
.
When NOT to use agentic RAG
- Single-fact questions. Naive RAG is cheaper.
- Latency budget < 1s. Agentic loops blow that.
- Strict per-query cost cap.
A pragmatic 2026 setup: route each query — easy questions get naive RAG; hard ones (detected by classifier or self-signal) get agentic.
Cost discipline
Bound the loop:
- Max retrievals per query (e.g., 5).
- Max LLM calls per query (e.g., 8).
- Hard timeout.
Without bounds, an off-by-one in your prompt becomes a $50 question.
For cost tactics see LLM Cost Optimization in 2026 .
Read this next
- Build a Production RAG App with pgvector
- AI Agents with LangGraph
- Rerankers in RAG
- Multi-Agent Systems Production Patterns
If you want a working LangGraph + pgvector + reranker agentic-RAG starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .