Does 1M context replace RAG?

Mostly no. 1M context is great for one-shot analysis (one big PDF). For repeated queries over a knowledge base, RAG is cheaper, faster, and easier to update. Long context complements RAG, doesn't replace.

Why does my answer get worse with more context?

Lost-in-the-middle. Models attend more to start and end of context; middle facts get neglected. Put the most important context near the end (or start).

LLM Context Windows in 2026 — Long Context, Cache, and the Limits of 'Just Add More'

By 2026, frontier models offer 200k+ context (some 1M). The temptation is to dump everything in. The reality: more context is not always better, and it’s never free. This post is the working set.

What long context is good at

Single-document analysis: a 500-page PDF, a long meeting transcript.
Codebase Q&A: dump a whole repo (within reason).
Initial onboarding: load all background once, then answer many questions.
Few-shot with rich examples: 30 examples instead of 5.

What it’s NOT good at

Repeated queries over a knowledge base: RAG is cheaper.
Real-time low-latency: longer context = higher TTFT.
Cost-sensitive volume: every query pays full input cost (without caching).
Long-tail facts in the middle of a haystack.

Lost in the middle

Empirical finding: models recall facts at the start and end of context better than the middle.

Doc: [important fact A] ... [50k tokens of distraction] ... [important fact B]
Query: "What's fact A and B?"
Result: A often missed; B usually recalled.

Mitigations:

Reorder by relevance before sending (retrieve relevant chunks; place at end).
Smaller context with retrieval beats huge context.
Repeat critical facts at the end / beginning.

Prompt caching changes the math

client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": HUGE_DOC, "cache_control": {"type": "ephemeral"}}
    ],
    messages=[{"role": "user", "content": query}],
)

200k-token document: $0.60 first call. Cached: $0.06 per subsequent call (within 5min window).

For interactive sessions over a doc: cache once, ask many times. See LLM Cost Optimization .

Retrieval still wins for most apps

	Long context	RAG
Cost per query	High	Low
Latency	Higher	Lower
Knowledge update	Re-send	Re-index
Cite specific source	Hard	Easy
Scale	Limited	Unlimited

For a chatbot over your docs: RAG. For “summarize this 200-page report”: long context + cache.

See RAG Patterns .

When to use each

Long context:

Single-doc analysis.
“Read this whole codebase.”
One-off complex synthesis.

RAG:

Knowledge base over many docs.
Frequently updated content.
Citation requirements.
Cost / latency sensitivity.

Hybrid:

RAG to retrieve top-k → put in long context with extra metadata.
Long context for the conversation; RAG for fresh lookups.

Cost reality

For Claude Sonnet 4.6 at $3/MTok input:

200k context, no cache: $0.60 per query.
1M context, no cache: $3 per query.
200k context, cached: $0.06 per query.

For 100 queries against the same doc:

No cache: $60.
With cache: $6.60 (1 full + 99 cached).

Caching is the difference between “expensive” and “cheap.”

Latency

Long context increases TTFT (time-to-first-token):

10k tokens: ~500ms TTFT.
100k tokens: ~3s TTFT.
1M tokens: 10s+ TTFT.

For chat: bad UX. For batch: fine.

Prompt caching reduces TTFT for cached portions; only fresh tokens prefill.

Working with long context

1. Structure matters

SYSTEM: ...
DOCUMENT: <big-doc>
INSTRUCTION: Answer the question. Cite verbatim quotes from the document.
QUESTION: <user query>

The instruction near the end stays in attention.

2. Use anchors

Tag sections of the doc:

<chapter id="intro">...</chapter>
<chapter id="architecture">...</chapter>

Now you can ask “What does the architecture chapter say about X?” and the model can find it more reliably.

3. Cite-and-quote pattern

“Answer the question. Quote the exact text from the document that supports your answer.” Forces the model to look in the doc, reducing hallucination.

4. Verify with retrieval

Get the answer from long context; then RAG-retrieve from the doc to verify the quote actually exists. Detects hallucination.

Code-context use case

Dumping a 50k-LOC codebase:

files = walk_repo()
context = "\n\n".join(
    f"=== {path} ===\n{content}" for path, content in files
)
# 200k tokens of code

resp = client.messages.create(
    model="claude-sonnet-4-6",
    system=[{"type": "text", "text": context, "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Where is auth handled?"}],
)

Initial query: $0.60. Subsequent queries: $0.06. Excellent for IDE integration / code Q&A.

For huge repos: filter by path prefix or relevance before dumping.

Common mistakes

1. Throwing everything in

100k of irrelevant context dilutes signal. Curate.

2. No caching

Same big context every query at full price.

3. Trusting recall blindly

Long-context models still hallucinate. Verify or cite.

4. Ignoring latency

10s TTFT in a chat UI is a regression. Test perceived performance.

5. Long context where RAG would do

Repeated queries over the same KB → RAG always.

What I’d ship today

For knowledge-app questions:

RAG by default.
Long context + cache for “analyze this big doc” features.
Hybrid: RAG retrieves; long context synthesizes.
Cite-and-quote patterns to limit hallucination.
Track cache hit rate; alert if low.
Latency monitoring; cap context size when interactive.

Read this next

If you want my long-context + cache reference patterns, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What long context is good at#

What it’s NOT good at#

Lost in the middle#

Prompt caching changes the math#

Retrieval still wins for most apps#

When to use each#

Cost reality#

Latency#

Working with long context#

1. Structure matters#

2. Use anchors#

3. Cite-and-quote pattern#

4. Verify with retrieval#

Code-context use case#

Common mistakes#

1. Throwing everything in#

2. No caching#

3. Trusting recall blindly#

4. Ignoring latency#

5. Long context where RAG would do#

What I’d ship today#

Read this next#