LLM streaming cheatsheet.

OpenAI streaming

stream = client.chat.completions.create(
    model="gpt-5",
    messages=[...],
    stream=True,
)

for chunk in stream:
    text = chunk.choices[0].delta.content or ""
    print(text, end="", flush=True)

Anthropic streaming

with client.messages.stream(model="claude-opus-4-7", max_tokens=1024, messages=[...]) as s:
    for text in s.text_stream:
        print(text, end="", flush=True)
    # Get final message
    msg = s.get_final_message()

FastAPI SSE endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
async def chat(req: dict):
    async def gen():
        async with client.messages.stream(
            model="claude-opus-4-7",
            max_tokens=1024,
            messages=req["messages"],
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {json.dumps({'text': text})}\n\n"
            yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        gen(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",     # disable nginx buffering
        },
    )

Frontend EventSource

const es = new EventSource("/chat");
es.onmessage = (e) => {
    if (e.data === "[DONE]") {
        es.close();
        return;
    }
    const { text } = JSON.parse(e.data);
    setResponse(prev => prev + text);
};

EventSource is GET only. For POST, use fetch + ReadableStream.

Fetch streaming (POST + body)

const res = await fetch("/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ messages }),
});

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    
    // Parse SSE lines
    const lines = buffer.split("\n");
    buffer = lines.pop();
    
    for (const line of lines) {
        if (line.startsWith("data: ")) {
            const data = line.slice(6);
            if (data === "[DONE]") return;
            const { text } = JSON.parse(data);
            setResponse(prev => prev + text);
        }
    }
}

Next.js route handler

export async function POST(req: Request) {
    const { messages } = await req.json();
    
    const stream = new ReadableStream({
        async start(controller) {
            const encoder = new TextEncoder();
            
            const openaiStream = await openai.chat.completions.create({
                model: "gpt-5",
                messages,
                stream: true,
            });
            
            for await (const chunk of openaiStream) {
                const text = chunk.choices[0]?.delta?.content || "";
                if (text) {
                    controller.enqueue(encoder.encode(`data: ${JSON.stringify({text})}\n\n`));
                }
            }
            controller.enqueue(encoder.encode("data: [DONE]\n\n"));
            controller.close();
        },
    });
    
    return new Response(stream, {
        headers: {
            "Content-Type": "text/event-stream",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        },
    });
}

export const runtime = "edge";

Vercel AI SDK (simplest)

import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

export async function POST(req: Request) {
    const { messages } = await req.json();
    
    const result = streamText({
        model: openai("gpt-5"),
        messages,
    });
    
    return result.toDataStreamResponse();
}

Client:

import { useChat } from "ai/react";

function Chat() {
    const { messages, input, handleSubmit, handleInputChange } = useChat();
    
    return (
        <>
            {messages.map(m => <div key={m.id}>{m.content}</div>)}
            <form onSubmit={handleSubmit}>
                <input value={input} onChange={handleInputChange} />
            </form>
        </>
    );
}

Handles streaming + state. Highly recommended.

Streaming JSON

For structured output you want to consume as it streams:

import json_stream

# Parse partial JSON as it streams
# Or use Anthropic's structured streaming events

Or buffer until valid JSON.

Cancellation

# Client disconnects → cancel LLM stream
async def gen():
    try:
        async with client.messages.stream(...) as s:
            async for text in s.text_stream:
                yield text
    except asyncio.CancelledError:
        # Client gone; stop billing
        pass

Nginx buffering

Disable nginx buffering for SSE:

location /chat {
    proxy_pass http://app;
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 24h;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    chunked_transfer_encoding off;
}

Backpressure

If client slow, accumulating chunks burns memory. Yield + flush regularly.

Token-by-token vs chunk

OpenAI/Anthropic stream tokens. Browser renders per chunk. For super-smooth UX, batch every few tokens.

Showing tool calls during stream

case "tool_call_start": showToolUI(name);
case "tool_call_result": updateToolUI(result);
case "text_delta": append(text);

UX: show “Calling search_web…” while tool runs.

Common mistakes

  • Forgetting X-Accel-Buffering: no → nginx buffers full response.
  • Not flushing → chunks held in proxy buffer.
  • Memory leak from unclosed streams.
  • No timeout → hung connections.
  • Race condition: client cancels but server keeps billing.

Read this next

If you want my streaming chat template, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .