By 2026 “prompt engineering” is the small problem. The bigger one is context engineering — what goes into the LLM’s context window, in what order, with what compression. This post is the working playbook.
Layers of context
A typical agentic call has 4–6 layers:
- System prompt (stable, behavior-defining).
- Tool definitions (mostly stable).
- Long-term memory (per-user facts).
- Retrieved knowledge (RAG chunks).
- Conversation history (recent turns).
- Current user message.
Each layer has different characteristics — caching, freshness, position. Treating them as one big prompt is the beginner mistake.
Position matters
Modern frontier models have position bias: they attend more to the start and end of context, less to the middle.
Practical rules:
- Critical “do this” instructions: top of system prompt.
- Long boilerplate (style guides, format docs): middle of system prompt where it’s compressed.
- The current user message: end of context. Always.
- Retrieved chunks: just before the user message; recent chunks last.
Caching breakpoints
Anthropic and OpenAI both cache stable prefixes for ~10% of normal cost. Order to maximize:
[1] Stable system prompt ← cache marker 1
[2] Tool definitions ← cache marker 2
[3] Long-term user memory ← cache marker 3 (if stable per session)
[4] Conversation prefix ← cache marker 4
[5] Latest message + retrieved ← dynamic, full price
For caching mechanics see Anthropic Claude API + Tool Use Guide and LLM Cost Optimization .
Compression
When context is tight:
- Summarize old turns. Replace 30 conversation turns with “Earlier the user asked about X; we agreed to Y.”
- Drop redundancy. If the system prompt and tool description say the same thing, pick one.
- Shorter examples. 3 tight examples > 10 verbose ones.
- References, not duplicates. “See [tool result above]” beats re-pasting it.
Selective retrieval
The naive retrieval: top-k chunks based on cosine similarity. Better:
- Multi-query retrieval: rephrase the question 3 ways; union the chunks.
- Recency weighting: newer chunks rank higher.
- Source diversity: don’t return 5 chunks from the same doc.
- Reranking: see Rerankers in RAG .
Memory injection
Per-user memory (Agent Memory ) goes into context as compact facts:
Relevant memories about user:
- prefers brief, direct responses
- works in Mumbai
- previously worked on Project Alpha
Not raw transcripts. Salient facts only.
When to skip layers
Not every call needs all layers. A simple classification:
- System prompt + user message. Done. No tools, no memory, no RAG.
The art is including only what helps.
Common mistakes
1. Stuffing context “just in case”
Every irrelevant token costs money and dilutes attention. Be ruthless.
2. Putting the user query at the start
Position bias hides it. Always at the end.
3. Bouncing between caching layouts
Today’s order: A B C. Tomorrow’s: B A C. Cache breaks. Pin order.
4. Re-injecting the same memory every turn
If a fact is in the memory layer, don’t repeat it in retrieved chunks. Wastes tokens.
5. Silent context overflow
Hit the model’s context limit; oldest content drops; you don’t notice until quality dips. Track input tokens; alert.
Read this next
- Prompt Engineering Patterns That Survive Production
- Agent Memory in 2026
- 1M-Token Context Windows in 2026
- LLM Cost Optimization in 2026
If you want my context-engineering checklist + token-budget tracker, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .