In 2026 you should not be parsing JSON out of LLM strings. The provider APIs do it for you, the open-source libraries do it for you, and the patterns are stable. This post is the working guide to structured output — what’s available, when to use what, and the validation-retry pattern that takes you from “mostly works” to “reliable.”

What “structured output” means

You define a schema. The LLM produces a value matching the schema. The framework validates it. If validation fails, retry with the error message until it passes (or give up after N attempts).

from pydantic import BaseModel

class Triage(BaseModel):
    category: str
    confidence: float
    reason: str

# the LLM returns a Triage instance, not a string
result: Triage = await classify(ticket_text)

No json.loads. No regex. No “is the model going to wrap this in markdown again?” The framework handles it.

Native provider APIs (use these first)

Anthropic — tool calling with tool_choice

from anthropic import Anthropic
from pydantic import BaseModel

client = Anthropic()

class Triage(BaseModel):
    category: str
    confidence: float
    reason: str

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=400,
    tools=[{
        "name": "triage",
        "description": "Return a structured triage decision.",
        "input_schema": Triage.model_json_schema(),
    }],
    tool_choice={"type": "tool", "name": "triage"},
    messages=[{"role": "user", "content": ticket_text}],
)

block = next(b for b in resp.content if b.type == "tool_use")
result = Triage.model_validate(block.input)

The tool_choice={"type": "tool", "name": "triage"} forces the model to call your tool. The input_schema is JSON Schema — Anthropic validates the model’s output against it. You get a typed Pydantic instance.

This is the canonical pattern for Anthropic. Covered in Anthropic Claude API + Tool Use Guide .

OpenAI — structured outputs

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class Triage(BaseModel):
    category: str
    confidence: float
    reason: str

resp = client.beta.chat.completions.parse(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": ticket_text}],
    response_format=Triage,
)

result = resp.choices[0].message.parsed   # already a Triage instance

.parse() (instead of .create()) takes a Pydantic class and returns a validated instance. OpenAI’s structured outputs are constrained-decoding-based — the model literally cannot produce off-schema tokens.

Gemini — response_schema

Same idea, different API. Gemini’s response_schema parameter on the client.

For all three providers, the native API is your default. They’re fast, reliable, and don’t cost extra.

Pydantic AI — the framework

Pydantic AI (the official Pydantic-team agent framework) wraps these provider APIs in a clean, multi-provider interface:

from pydantic_ai import Agent
from pydantic import BaseModel

class Triage(BaseModel):
    category: str
    confidence: float
    reason: str

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    result_type=Triage,
    system_prompt="Classify customer support tickets.",
)

result = await agent.run(ticket_text)
print(result.data)              # Triage instance

What you get:

  • Provider-agnostic. Switch from anthropic:claude-sonnet-4-6 to openai:gpt-5-mini by changing one string.
  • Type-safe. The result.data is typed Triage end-to-end.
  • Tool calling. Add tools=[my_tool] and the agent can call them.
  • Streaming. agent.run_stream(...) for SSE-style consumption.
  • Validation retry. Built-in.

For full agents, Pydantic AI is competitive with LangGraph. For just typed extraction, it’s overkill but very pleasant. See AI Agents with LangGraph for the alternative.

Instructor — the focused library

If you want one library to give you typed JSON from any provider:

import instructor
from openai import OpenAI
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())

class Triage(BaseModel):
    category: str
    confidence: float
    reason: str

result = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": ticket_text}],
    response_model=Triage,
)
# result is a Triage instance

Same shape works with Anthropic, Gemini, Cohere, Groq, Mistral, Ollama. Instructor abstracts the differences.

What’s nice:

  • Drop-in. Wraps your existing OpenAI client.
  • Validation retry. Set max_retries=3 and Instructor handles “model returned wrong shape, ask again.”
  • Streaming structured output. Yields partial Pydantic instances as fields fill in.

When to pick Instructor over Pydantic AI:

  • You have existing OpenAI/Anthropic SDK code and don’t want a new framework.
  • You only want structured output (no agent loops, tools, multi-step).
  • You want the least amount of new abstractions.

When to pick Pydantic AI:

  • You want a full agent framework.
  • You’re building from scratch.
  • You want first-class streaming, tool use, and provider switching.

The validation-retry pattern

Native APIs work ~98% of the time. Sometimes the model still produces invalid output (very rare with constrained decoding, more common when you ask for free-form generation that contains structure). The pattern:

async def extract_with_retry(text: str, max_retries: int = 3) -> Triage:
    last_error = None
    messages = [{"role": "user", "content": text}]

    for attempt in range(max_retries):
        try:
            resp = await client.messages.create(...)
            block = next(b for b in resp.content if b.type == "tool_use")
            return Triage.model_validate(block.input)
        except ValidationError as e:
            last_error = e
            messages.append({"role": "assistant", "content": resp.content})
            messages.append({
                "role": "user",
                "content": f"Validation error: {e.json()}\nReturn a valid response.",
            })

    raise RuntimeError(f"Failed after {max_retries}: {last_error}")

Pass the validation error back to the model. It usually fixes itself in one retry.

Pydantic AI and Instructor do this for you. Worth knowing the pattern when you build your own.

Real-world schemas

Some schema shapes that come up constantly:

Extraction with optionals

class Invoice(BaseModel):
    invoice_number: str = Field(..., description="The invoice number, e.g. 'INV-2026-0042'")
    total: Decimal = Field(..., description="Total amount including tax")
    currency: str = Field(..., pattern=r"^[A-Z]{3}$")
    due_date: date | None = None
    line_items: list[LineItem]

The Field(..., description=...) is gold — descriptions become part of the prompt the model sees. Be specific.

Enums for categorical

from enum import Enum

class Severity(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class Issue(BaseModel):
    severity: Severity
    title: str

Enums get constrained-decoded; the model can’t pick something off the list.

Discriminated unions

class TextResponse(BaseModel):
    type: Literal["text"]
    content: str

class ToolCallResponse(BaseModel):
    type: Literal["tool_call"]
    tool: str
    args: dict

Response = Annotated[
    Union[TextResponse, ToolCallResponse],
    Field(discriminator="type"),
]

The model picks one variant; Pydantic validates the right shape. Useful when an LLM might return one of several shapes.

Refining with validators

class Email(BaseModel):
    address: str
    @field_validator("address")
    def must_be_company_email(cls, v: str) -> str:
        if not v.endswith("@example.com"):
            raise ValueError("must be a company email")
        return v

Pydantic validators run after the LLM produces output. If the value is structurally valid but semantically wrong, you’ll catch it and retry.

Cost notes

Structured output isn’t free. Two costs to know:

  • Schema description tokens. Every field’s description goes into the prompt. Long descriptions = more input tokens. Be concise but specific.
  • Validation retries. When validation fails, you spend tokens on retry. With max_retries=3 worst case 4× tokens; in practice <1.05× because retries are rare.

Cache the system prompt + schema (Anthropic prompt caching, OpenAI’s similar feature). The schema definition is stable across calls; cache it.

See Anthropic Claude API + Tool Use Guide for caching mechanics.

Common mistakes

1. Asking for JSON in the prompt

prompt = "Return JSON with these fields..."     # ⛔ unreliable

Use the native structured-output API. Always.

2. Over-nesting

A schema 6 levels deep is hard for the model to fill consistently. Flatten where possible. The model’s accuracy drops with structural complexity.

3. No description on fields

total: Decimal                                  # ❌ no hint
total: Decimal = Field(..., description="The total amount, including taxes")  # ✅

The description is half the prompt. Treat it like documentation for the model.

4. Returning a dict instead of a model

return resp.model_dump()                        # loses types downstream
return resp                                     # keep it typed

The whole point is end-to-end types. Don’t unpack.

5. No eval set

Structured-output reliability is measurable. Build a 30-row eval set. Score on every model upgrade. See LLM Evaluations .

Read this next

If you want a starter Python module with Pydantic AI + Instructor + retry + eval harness, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .