Prompt engineering matured. The cargo-cult phrases of 2023 (“you are an expert”, “think step by step”) matter less; clear specification matters more. This post is the working set.

What still works

  • Specify the task precisely. “Summarize” is ambiguous; “Summarize in 3 bullet points, max 20 words each” is not.
  • Constrain output format. Tool calling > “respond as JSON”.
  • Examples for non-obvious formats (1–3 is plenty).
  • Reasoning prompts still help on hard tasks: “Think step by step before answering” or extended-thinking modes.
  • Role tags for untrusted input: <user_input>...</user_input>.

What’s obsolete

  • “You are an expert…” — the model’s expertise comes from training, not flattery.
  • “Take a deep breath” — got worse with newer models.
  • Verbose formatting instructions when tool calling exists — schema enforces shape.
  • Repeating the same instruction many times — say it once, clearly.

The structure that works

SYSTEM:
Role / persona (one paragraph).
Capabilities and limits.
Output format requirements.

USER:
Context (data, retrieval results).
Specific question.

Concise. Predictable. Easy to debug when it fails.

Tool calling for shape

client.messages.create(
    model="claude-sonnet-4-6",
    tools=[{"name": "respond", "input_schema": ResponseSchema.model_json_schema()}],
    tool_choice={"type": "tool", "name": "respond"},
    messages=[{"role": "user", "content": prompt}],
)

Schema-bound output. No “please return JSON” prayers. See Structured Output .

Few-shot

User input: "I was charged twice for May."
Expected category: billing

User input: "How do I export my data?"
Expected category: how_to

User input: "Account is locked."
Expected category: account

Then the real input. 3 examples is the sweet spot for most classification tasks.

Chain-of-thought

Question: <hard math/logic problem>

Think step by step. Show your reasoning, then give the final answer at the end as 'Answer: <X>'.

Or use models with extended thinking enabled — they reason internally without bloating the visible output.

Self-consistency

For high-stakes answers: sample N completions, take the majority. Costs N× but improves accuracy.

async def consistent(prompt, n=5):
    answers = await asyncio.gather(*[llm.complete(prompt, temperature=0.7) for _ in range(n)])
    return Counter(answers).most_common(1)[0][0]

Use sparingly — usually overkill.

Anchoring with tags

Here is the document:
<doc>
{doc}
</doc>

Here is the user question:
<question>
{question}
</question>

Answer the question using only information from the document.

XML-style tags help models locate parts of the prompt. Particularly useful for long-context.

Negative instructions

“Don’t include disclaimers” works. “Don’t say ‘as an AI’” mostly works. Combining many negatives confuses models.

Better: positive specification of what you DO want.

Refusal-on-uncertainty

If the document does not contain enough information to answer, say
'I don't have that information' instead of guessing.

Reduces hallucination. Critical for RAG. See LLM Guardrails .

Citations

Answer the question. After each claim, cite the source like [doc-3].
Quote exact text in quotes when claiming specific facts.

Models comply well with this; users trust answers more; you can verify.

Multi-step tasks

For complex tasks, chain steps:

Step 1: Extract entities.
Step 2: Classify each.
Step 3: Format output.

vs trying to do all three in one prompt with a giant schema. Smaller LLM calls compose better.

But: each chain step costs latency. Balance.

Common mistakes

1. Vague instructions

“Make it sound professional.” What does that mean to the model? Be specific: “Use third person; avoid contractions; max 100 words per paragraph.”

2. Conflicting instructions

“Be concise. Provide complete details. Use formal language. Be friendly.” Prioritize.

3. Putting critical info in the middle

Lost-in-the-middle. Place the question / key context near the end. See LLM Context Windows .

4. No format spec

Free-form text where structured output would do. Use tool calling.

5. Praying instead of testing

“This prompt works.” Have you evaluated on 50 cases? See LLM Evaluation .

Iteration loop

  1. Start simple — clear instruction + maybe one example.
  2. Run on eval set.
  3. Inspect failures — what’s the model misunderstanding?
  4. Tighten the prompt — add example, constraint, or clarification specifically for that failure.
  5. Re-run eval. Don’t add until eval improves.

This iterative loop beats “add 200 lines of guidance” every time.

Prompt versioning

PROMPT_V = "answer-v3"
prompt = registry.get(PROMPT_V).compile(question=q)

Track which prompt generated which output. Compare across versions. See LLM Observability .

What I’d ship today

  • Concise system prompt (< 200 words usually).
  • Tool calling for structured outputs.
  • 1–3 examples when format isn’t obvious.
  • Tags for untrusted input.
  • Refuse-on-uncertainty for RAG.
  • Eval set before changing prompts.
  • Versioning in production.

Read this next

If you want my prompt template library + eval harness, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .