Frontier models in 2026 ship with 1M-token contexts: Claude Opus 4.7, Gemini 2.5 Pro, several open-source options. The temptation is real — forget RAG, just put the whole corpus in. This post is the honest answer to “should I?”

What 1M tokens buys you

  • ~750k words of text.
  • ~3000 pages of code.
  • A medium-sized book + all notes.
  • A 100k-LoC repo with comments.

Real fits:

  • Reading a long contract end-to-end.
  • Q&A over a single book.
  • Code review across an entire repo.
  • Summarizing a meeting + 20 referenced docs.

Real misfits:

  • 50,000-document knowledge bases.
  • Frequently updated data.
  • Multi-tenant where cross-tenant leakage would be catastrophic.

Performance reality

Modern frontier models perform well on long-context retrieval — but not perfectly:

  • Needle-in-haystack: 95%+ recall up to 1M tokens.
  • Multi-step reasoning across long contexts: drops above 200k.
  • Position bias: middle of context is weaker than start/end.

Benchmark on your data, not headline numbers.

Cost reality

Without caching: 1M tokens × $15/MTok input = $15 per query. Not viable.

With caching: $1.50–$2 per query (after first). New questions = small dynamic content + cached corpus. This is what makes it affordable. Mechanics: Anthropic Claude API + Tool Use Guide .

Long context vs RAG decision rule

ScenarioPick
Single doc Q&A (book, contract)Long context
Multi-doc, large corpus (>200)RAG
Frequently changing dataRAG
Strict citation traceabilityRAG (chunk IDs)
One-shot analysis (review this PR)Long context
Multi-tenant isolation neededRAG with per-tenant scope

A common 2026 hybrid: RAG for the broad corpus + long context for top-N synthesis. Best of both.

Patterns

Document analysis

async def analyze_contract(contract: str, q: str) -> str:
    return await client.messages.create(
        model="claude-opus-4-7",
        system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": [
            {"type": "text", "text": contract, "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": q},
        ]}],
    )

The contract is cached. Each follow-up question is cheap.

Code repo review

Load the repo (markdown-formatted, file headers) into context, then ask cross-file questions. Surprisingly effective.

Meeting + references

Transcript + linked docs + Slack threads → one big prompt → “summarize action items.”

Pitfalls

Position bias

Critical info at character 800,000 might not be retrieved. Place important facts near beginning or end. Or repeat at both ends.

No reranking inside long context

If you stuff 1M tokens, the model can’t filter precisely. RAG + rerank (Rerankers in RAG ) gives precision; long context relies on attention.

Latency

A 1M-token prompt: 30–60s on first call. Cached: TTFT improves dramatically. For interactive use, cache aggressively.

Cost without caching

Forgetting the cache marker once = 10× cost spike on that call. Audit.

What I’d do today

Single document or small corpus? Try long context first. Often “good enough” with zero RAG infrastructure.

Many users on different data? RAG, with long-context synthesis as a within-RAG step.

Maximum quality? Hybrid retrieval → top-30 chunks → long-context synthesis with caching.

Read this next

If you want a long-context “ask my repo” agent template, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .