Frontier models in 2026 ship with 1M-token contexts: Claude Opus 4.7, Gemini 2.5 Pro, several open-source options. The temptation is real — forget RAG, just put the whole corpus in. This post is the honest answer to “should I?”
What 1M tokens buys you
- ~750k words of text.
- ~3000 pages of code.
- A medium-sized book + all notes.
- A 100k-LoC repo with comments.
Real fits:
- Reading a long contract end-to-end.
- Q&A over a single book.
- Code review across an entire repo.
- Summarizing a meeting + 20 referenced docs.
Real misfits:
- 50,000-document knowledge bases.
- Frequently updated data.
- Multi-tenant where cross-tenant leakage would be catastrophic.
Performance reality
Modern frontier models perform well on long-context retrieval — but not perfectly:
- Needle-in-haystack: 95%+ recall up to 1M tokens.
- Multi-step reasoning across long contexts: drops above 200k.
- Position bias: middle of context is weaker than start/end.
Benchmark on your data, not headline numbers.
Cost reality
Without caching: 1M tokens × $15/MTok input = $15 per query. Not viable.
With caching: $1.50–$2 per query (after first). New questions = small dynamic content + cached corpus. This is what makes it affordable. Mechanics: Anthropic Claude API + Tool Use Guide .
Long context vs RAG decision rule
| Scenario | Pick |
|---|---|
| Single doc Q&A (book, contract) | Long context |
| Multi-doc, large corpus (>200) | RAG |
| Frequently changing data | RAG |
| Strict citation traceability | RAG (chunk IDs) |
| One-shot analysis (review this PR) | Long context |
| Multi-tenant isolation needed | RAG with per-tenant scope |
A common 2026 hybrid: RAG for the broad corpus + long context for top-N synthesis. Best of both.
Patterns
Document analysis
async def analyze_contract(contract: str, q: str) -> str:
return await client.messages.create(
model="claude-opus-4-7",
system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": contract, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": q},
]}],
)
The contract is cached. Each follow-up question is cheap.
Code repo review
Load the repo (markdown-formatted, file headers) into context, then ask cross-file questions. Surprisingly effective.
Meeting + references
Transcript + linked docs + Slack threads → one big prompt → “summarize action items.”
Pitfalls
Position bias
Critical info at character 800,000 might not be retrieved. Place important facts near beginning or end. Or repeat at both ends.
No reranking inside long context
If you stuff 1M tokens, the model can’t filter precisely. RAG + rerank (Rerankers in RAG ) gives precision; long context relies on attention.
Latency
A 1M-token prompt: 30–60s on first call. Cached: TTFT improves dramatically. For interactive use, cache aggressively.
Cost without caching
Forgetting the cache marker once = 10× cost spike on that call. Audit.
What I’d do today
Single document or small corpus? Try long context first. Often “good enough” with zero RAG infrastructure.
Many users on different data? RAG, with long-context synthesis as a within-RAG step.
Maximum quality? Hybrid retrieval → top-30 chunks → long-context synthesis with caching.
Read this next
- Build a Production RAG App with pgvector
- Anthropic Claude API + Tool Use Guide
- LLM Cost Optimization in 2026
- Agentic RAG in 2026
If you want a long-context “ask my repo” agent template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .