Should I use 1M context instead of RAG?

For small-to-medium corpora that fit (under ~500k tokens after caching), often yes. For large or frequently-updated corpora, RAG still wins on cost, freshness, and citation precision.

Is the model good at finding info in 1M tokens?

On clean needle-in-haystack tests, modern frontier models score 95%+ up to 1M. On real messy contexts, performance dips. Test on your data.

Does prompt caching make long context affordable?

It makes it possible. Cached input tokens cost ~10% of normal — turning a $0.30 query into $0.03 when context is reused. Without caching, 1M-token queries are usually too expensive at scale.

1M-Token Context Windows in 2026 — When They Help, When They Hurt

Frontier models in 2026 ship with 1M-token contexts: Claude Opus 4.7, Gemini 2.5 Pro, several open-source options. The temptation is real — forget RAG, just put the whole corpus in. This post is the honest answer to “should I?”

What 1M tokens buys you

~750k words of text.
~3000 pages of code.
A medium-sized book + all notes.
A 100k-LoC repo with comments.

Real fits:

Reading a long contract end-to-end.
Q&A over a single book.
Code review across an entire repo.
Summarizing a meeting + 20 referenced docs.

Real misfits:

50,000-document knowledge bases.
Frequently updated data.
Multi-tenant where cross-tenant leakage would be catastrophic.

Performance reality

Modern frontier models perform well on long-context retrieval — but not perfectly:

Needle-in-haystack: 95%+ recall up to 1M tokens.
Multi-step reasoning across long contexts: drops above 200k.
Position bias: middle of context is weaker than start/end.

Benchmark on your data, not headline numbers.

Cost reality

Without caching: 1M tokens × $15/MTok input = $15 per query. Not viable.

With caching: $1.50–$2 per query (after first). New questions = small dynamic content + cached corpus. This is what makes it affordable. Mechanics: Anthropic Claude API + Tool Use Guide .

Long context vs RAG decision rule

Scenario	Pick
Single doc Q&A (book, contract)	Long context
Multi-doc, large corpus (>200)	RAG
Frequently changing data	RAG
Strict citation traceability	RAG (chunk IDs)
One-shot analysis (review this PR)	Long context
Multi-tenant isolation needed	RAG with per-tenant scope

A common 2026 hybrid: RAG for the broad corpus + long context for top-N synthesis. Best of both.

Patterns

Document analysis

async def analyze_contract(contract: str, q: str) -> str:
    return await client.messages.create(
        model="claude-opus-4-7",
        system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": [
            {"type": "text", "text": contract, "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": q},
        ]}],
    )

The contract is cached. Each follow-up question is cheap.

Code repo review

Load the repo (markdown-formatted, file headers) into context, then ask cross-file questions. Surprisingly effective.

Meeting + references

Transcript + linked docs + Slack threads → one big prompt → “summarize action items.”

Pitfalls

Position bias

Critical info at character 800,000 might not be retrieved. Place important facts near beginning or end. Or repeat at both ends.

No reranking inside long context

If you stuff 1M tokens, the model can’t filter precisely. RAG + rerank (Rerankers in RAG ) gives precision; long context relies on attention.

Latency

A 1M-token prompt: 30–60s on first call. Cached: TTFT improves dramatically. For interactive use, cache aggressively.

Cost without caching

Forgetting the cache marker once = 10× cost spike on that call. Audit.

What I’d do today

Single document or small corpus? Try long context first. Often “good enough” with zero RAG infrastructure.

Many users on different data? RAG, with long-context synthesis as a within-RAG step.

Maximum quality? Hybrid retrieval → top-30 chunks → long-context synthesis with caching.

Read this next

If you want a long-context “ask my repo” agent template, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What 1M tokens buys you#

Performance reality#

Cost reality#

Long context vs RAG decision rule#

Patterns#

Document analysis#

Code repo review#

Meeting + references#

Pitfalls#

Position bias#

No reranking inside long context#

Latency#

Cost without caching#

What I’d do today#

Read this next#