In 2026 you should not be parsing JSON out of LLM strings. The provider APIs do it for you, the open-source libraries do it for you, and the patterns are stable. This post is the working guide to structured output — what’s available, when to use what, and the validation-retry pattern that takes you from “mostly works” to “reliable.”
What “structured output” means
You define a schema. The LLM produces a value matching the schema. The framework validates it. If validation fails, retry with the error message until it passes (or give up after N attempts).
from pydantic import BaseModel
class Triage(BaseModel):
category: str
confidence: float
reason: str
# the LLM returns a Triage instance, not a string
result: Triage = await classify(ticket_text)
No json.loads. No regex. No “is the model going to wrap this in markdown again?” The framework handles it.
Native provider APIs (use these first)
Anthropic — tool calling with tool_choice
from anthropic import Anthropic
from pydantic import BaseModel
client = Anthropic()
class Triage(BaseModel):
category: str
confidence: float
reason: str
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=400,
tools=[{
"name": "triage",
"description": "Return a structured triage decision.",
"input_schema": Triage.model_json_schema(),
}],
tool_choice={"type": "tool", "name": "triage"},
messages=[{"role": "user", "content": ticket_text}],
)
block = next(b for b in resp.content if b.type == "tool_use")
result = Triage.model_validate(block.input)
The tool_choice={"type": "tool", "name": "triage"} forces the model to call your tool. The input_schema is JSON Schema — Anthropic validates the model’s output against it. You get a typed Pydantic instance.
This is the canonical pattern for Anthropic. Covered in Anthropic Claude API + Tool Use Guide .
OpenAI — structured outputs
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class Triage(BaseModel):
category: str
confidence: float
reason: str
resp = client.beta.chat.completions.parse(
model="gpt-5-mini",
messages=[{"role": "user", "content": ticket_text}],
response_format=Triage,
)
result = resp.choices[0].message.parsed # already a Triage instance
.parse() (instead of .create()) takes a Pydantic class and returns a validated instance. OpenAI’s structured outputs are constrained-decoding-based — the model literally cannot produce off-schema tokens.
Gemini — response_schema
Same idea, different API. Gemini’s response_schema parameter on the client.
For all three providers, the native API is your default. They’re fast, reliable, and don’t cost extra.
Pydantic AI — the framework
Pydantic AI (the official Pydantic-team agent framework) wraps these provider APIs in a clean, multi-provider interface:
from pydantic_ai import Agent
from pydantic import BaseModel
class Triage(BaseModel):
category: str
confidence: float
reason: str
agent = Agent(
"anthropic:claude-sonnet-4-6",
result_type=Triage,
system_prompt="Classify customer support tickets.",
)
result = await agent.run(ticket_text)
print(result.data) # Triage instance
What you get:
- Provider-agnostic. Switch from
anthropic:claude-sonnet-4-6toopenai:gpt-5-miniby changing one string. - Type-safe. The
result.datais typedTriageend-to-end. - Tool calling. Add
tools=[my_tool]and the agent can call them. - Streaming.
agent.run_stream(...)for SSE-style consumption. - Validation retry. Built-in.
For full agents, Pydantic AI is competitive with LangGraph. For just typed extraction, it’s overkill but very pleasant. See AI Agents with LangGraph for the alternative.
Instructor — the focused library
If you want one library to give you typed JSON from any provider:
import instructor
from openai import OpenAI
from pydantic import BaseModel
client = instructor.from_openai(OpenAI())
class Triage(BaseModel):
category: str
confidence: float
reason: str
result = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": ticket_text}],
response_model=Triage,
)
# result is a Triage instance
Same shape works with Anthropic, Gemini, Cohere, Groq, Mistral, Ollama. Instructor abstracts the differences.
What’s nice:
- Drop-in. Wraps your existing OpenAI client.
- Validation retry. Set
max_retries=3and Instructor handles “model returned wrong shape, ask again.” - Streaming structured output. Yields partial Pydantic instances as fields fill in.
When to pick Instructor over Pydantic AI:
- You have existing OpenAI/Anthropic SDK code and don’t want a new framework.
- You only want structured output (no agent loops, tools, multi-step).
- You want the least amount of new abstractions.
When to pick Pydantic AI:
- You want a full agent framework.
- You’re building from scratch.
- You want first-class streaming, tool use, and provider switching.
The validation-retry pattern
Native APIs work ~98% of the time. Sometimes the model still produces invalid output (very rare with constrained decoding, more common when you ask for free-form generation that contains structure). The pattern:
async def extract_with_retry(text: str, max_retries: int = 3) -> Triage:
last_error = None
messages = [{"role": "user", "content": text}]
for attempt in range(max_retries):
try:
resp = await client.messages.create(...)
block = next(b for b in resp.content if b.type == "tool_use")
return Triage.model_validate(block.input)
except ValidationError as e:
last_error = e
messages.append({"role": "assistant", "content": resp.content})
messages.append({
"role": "user",
"content": f"Validation error: {e.json()}\nReturn a valid response.",
})
raise RuntimeError(f"Failed after {max_retries}: {last_error}")
Pass the validation error back to the model. It usually fixes itself in one retry.
Pydantic AI and Instructor do this for you. Worth knowing the pattern when you build your own.
Real-world schemas
Some schema shapes that come up constantly:
Extraction with optionals
class Invoice(BaseModel):
invoice_number: str = Field(..., description="The invoice number, e.g. 'INV-2026-0042'")
total: Decimal = Field(..., description="Total amount including tax")
currency: str = Field(..., pattern=r"^[A-Z]{3}$")
due_date: date | None = None
line_items: list[LineItem]
The Field(..., description=...) is gold — descriptions become part of the prompt the model sees. Be specific.
Enums for categorical
from enum import Enum
class Severity(str, Enum):
low = "low"
medium = "medium"
high = "high"
critical = "critical"
class Issue(BaseModel):
severity: Severity
title: str
Enums get constrained-decoded; the model can’t pick something off the list.
Discriminated unions
class TextResponse(BaseModel):
type: Literal["text"]
content: str
class ToolCallResponse(BaseModel):
type: Literal["tool_call"]
tool: str
args: dict
Response = Annotated[
Union[TextResponse, ToolCallResponse],
Field(discriminator="type"),
]
The model picks one variant; Pydantic validates the right shape. Useful when an LLM might return one of several shapes.
Refining with validators
class Email(BaseModel):
address: str
@field_validator("address")
def must_be_company_email(cls, v: str) -> str:
if not v.endswith("@example.com"):
raise ValueError("must be a company email")
return v
Pydantic validators run after the LLM produces output. If the value is structurally valid but semantically wrong, you’ll catch it and retry.
Cost notes
Structured output isn’t free. Two costs to know:
- Schema description tokens. Every field’s description goes into the prompt. Long descriptions = more input tokens. Be concise but specific.
- Validation retries. When validation fails, you spend tokens on retry. With
max_retries=3worst case 4× tokens; in practice <1.05× because retries are rare.
Cache the system prompt + schema (Anthropic prompt caching, OpenAI’s similar feature). The schema definition is stable across calls; cache it.
See Anthropic Claude API + Tool Use Guide for caching mechanics.
Common mistakes
1. Asking for JSON in the prompt
prompt = "Return JSON with these fields..." # ⛔ unreliable
Use the native structured-output API. Always.
2. Over-nesting
A schema 6 levels deep is hard for the model to fill consistently. Flatten where possible. The model’s accuracy drops with structural complexity.
3. No description on fields
total: Decimal # ❌ no hint
total: Decimal = Field(..., description="The total amount, including taxes") # ✅
The description is half the prompt. Treat it like documentation for the model.
4. Returning a dict instead of a model
return resp.model_dump() # loses types downstream
return resp # keep it typed
The whole point is end-to-end types. Don’t unpack.
5. No eval set
Structured-output reliability is measurable. Build a 30-row eval set. Score on every model upgrade. See LLM Evaluations .
Read this next
- Anthropic Claude API + Tool Use Guide — the API foundation.
- Prompt Engineering Patterns That Survive Production — when to add prompt structure.
- LLM Evaluations — measure reliability.
- AI Agents with LangGraph in 2026
If you want a starter Python module with Pydantic AI + Instructor + retry + eval harness, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .