Naive RAG works on simple questions: one query, one retrieval, one answer. The moment a real user asks “compare how the docs say to handle errors versus what the code actually does,” it falls apart. Agentic RAG is the response — let the LLM drive retrieval as a tool, with multi-step reasoning, decomposition, and self-reflection.

What changes

Naive RAGAgentic RAG
Retrievals per query11–10
LLM calls per query12–10
Decision-makerPipelineLLM
Best forDirect lookupsMulti-step, comparative
Cost2–5×
Latency3–6×

The cost increase buys you handling questions naive RAG simply gets wrong.

The shape

User question
  
LLM with retrieve() tool
  
Plan / decompose
  
Loop: retrieve(sub-q)  think  retrieve(next) ...
  
Synthesize from results
  
Self-reflect: did I answer fully?
  
Final answer with citations

The LLM is the orchestrator. Retrieval is one tool among others.

LangGraph implementation

from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode

@tool
async def retrieve(query: str, k: int = 5) -> str:
    docs = await pgvector_search(query, k=k)
    return "\n\n".join(f"[{d.id}] {d.content}" for d in docs)

@tool
async def web_search(query: str) -> str:
    return await tavily_search(query, max_results=5)


tools = [retrieve, web_search]
llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0).bind_tools(tools)


def call_model(state):
    msg = llm.invoke(state["messages"])
    return {"messages": [msg]}


graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", tools_condition, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")
agent = graph.compile()

The agent decides whether to retrieve, what to retrieve, and when it has enough. See AI Agents with LangGraph for the LangGraph mechanics.

Patterns that ship

1. Query rewriting

The user’s “what about that bug from yesterday?” doesn’t retrieve well alone. The agent rewrites with conversation context.

2. Decomposition

The user asks “how do A, B, and C compare?” Agent issues three retrievals, one per sub-question, then synthesizes.

3. Self-reflection

After answering, the agent asks itself: “Did I cover everything?” If no, retrieve more. Add to system prompt:

After your initial answer, check whether the user’s question is fully addressed. If not, identify what’s missing and call retrieve() for it.

4. Citations

Every claim cites the chunk that supported it. The model emits [id] markers; the wrapper looks up URLs server-side. See Build a RAG App with pgvector .

When NOT to use agentic RAG

  • Single-fact questions. Naive RAG is cheaper.
  • Latency budget < 1s. Agentic loops blow that.
  • Strict per-query cost cap.

A pragmatic 2026 setup: route each query — easy questions get naive RAG; hard ones (detected by classifier or self-signal) get agentic.

Cost discipline

Bound the loop:

  • Max retrievals per query (e.g., 5).
  • Max LLM calls per query (e.g., 8).
  • Hard timeout.

Without bounds, an off-by-one in your prompt becomes a $50 question.

For cost tactics see LLM Cost Optimization in 2026 .

Read this next

If you want a working LangGraph + pgvector + reranker agentic-RAG starter, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .