Prompt engineering matured. The cargo-cult phrases of 2023 (“you are an expert”, “think step by step”) matter less; clear specification matters more. This post is the working set.
What still works
- Specify the task precisely. “Summarize” is ambiguous; “Summarize in 3 bullet points, max 20 words each” is not.
- Constrain output format. Tool calling > “respond as JSON”.
- Examples for non-obvious formats (1–3 is plenty).
- Reasoning prompts still help on hard tasks: “Think step by step before answering” or extended-thinking modes.
- Role tags for untrusted input:
<user_input>...</user_input>.
What’s obsolete
- “You are an expert…” — the model’s expertise comes from training, not flattery.
- “Take a deep breath” — got worse with newer models.
- Verbose formatting instructions when tool calling exists — schema enforces shape.
- Repeating the same instruction many times — say it once, clearly.
The structure that works
SYSTEM:
Role / persona (one paragraph).
Capabilities and limits.
Output format requirements.
USER:
Context (data, retrieval results).
Specific question.
Concise. Predictable. Easy to debug when it fails.
Tool calling for shape
client.messages.create(
model="claude-sonnet-4-6",
tools=[{"name": "respond", "input_schema": ResponseSchema.model_json_schema()}],
tool_choice={"type": "tool", "name": "respond"},
messages=[{"role": "user", "content": prompt}],
)
Schema-bound output. No “please return JSON” prayers. See Structured Output .
Few-shot
User input: "I was charged twice for May."
Expected category: billing
User input: "How do I export my data?"
Expected category: how_to
User input: "Account is locked."
Expected category: account
Then the real input. 3 examples is the sweet spot for most classification tasks.
Chain-of-thought
Question: <hard math/logic problem>
Think step by step. Show your reasoning, then give the final answer at the end as 'Answer: <X>'.
Or use models with extended thinking enabled — they reason internally without bloating the visible output.
Self-consistency
For high-stakes answers: sample N completions, take the majority. Costs N× but improves accuracy.
async def consistent(prompt, n=5):
answers = await asyncio.gather(*[llm.complete(prompt, temperature=0.7) for _ in range(n)])
return Counter(answers).most_common(1)[0][0]
Use sparingly — usually overkill.
Anchoring with tags
Here is the document:
<doc>
{doc}
</doc>
Here is the user question:
<question>
{question}
</question>
Answer the question using only information from the document.
XML-style tags help models locate parts of the prompt. Particularly useful for long-context.
Negative instructions
“Don’t include disclaimers” works. “Don’t say ‘as an AI’” mostly works. Combining many negatives confuses models.
Better: positive specification of what you DO want.
Refusal-on-uncertainty
If the document does not contain enough information to answer, say
'I don't have that information' instead of guessing.
Reduces hallucination. Critical for RAG. See LLM Guardrails .
Citations
Answer the question. After each claim, cite the source like [doc-3].
Quote exact text in quotes when claiming specific facts.
Models comply well with this; users trust answers more; you can verify.
Multi-step tasks
For complex tasks, chain steps:
Step 1: Extract entities.
Step 2: Classify each.
Step 3: Format output.
vs trying to do all three in one prompt with a giant schema. Smaller LLM calls compose better.
But: each chain step costs latency. Balance.
Common mistakes
1. Vague instructions
“Make it sound professional.” What does that mean to the model? Be specific: “Use third person; avoid contractions; max 100 words per paragraph.”
2. Conflicting instructions
“Be concise. Provide complete details. Use formal language. Be friendly.” Prioritize.
3. Putting critical info in the middle
Lost-in-the-middle. Place the question / key context near the end. See LLM Context Windows .
4. No format spec
Free-form text where structured output would do. Use tool calling.
5. Praying instead of testing
“This prompt works.” Have you evaluated on 50 cases? See LLM Evaluation .
Iteration loop
- Start simple — clear instruction + maybe one example.
- Run on eval set.
- Inspect failures — what’s the model misunderstanding?
- Tighten the prompt — add example, constraint, or clarification specifically for that failure.
- Re-run eval. Don’t add until eval improves.
This iterative loop beats “add 200 lines of guidance” every time.
Prompt versioning
PROMPT_V = "answer-v3"
prompt = registry.get(PROMPT_V).compile(question=q)
Track which prompt generated which output. Compare across versions. See LLM Observability .
What I’d ship today
- Concise system prompt (< 200 words usually).
- Tool calling for structured outputs.
- 1–3 examples when format isn’t obvious.
- Tags for untrusted input.
- Refuse-on-uncertainty for RAG.
- Eval set before changing prompts.
- Versioning in production.
Read this next
If you want my prompt template library + eval harness, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .