LLM streaming cheatsheet.
OpenAI streaming
stream = client.chat.completions.create(
model="gpt-5",
messages=[...],
stream=True,
)
for chunk in stream:
text = chunk.choices[0].delta.content or ""
print(text, end="", flush=True)
Anthropic streaming
with client.messages.stream(model="claude-opus-4-7", max_tokens=1024, messages=[...]) as s:
for text in s.text_stream:
print(text, end="", flush=True)
# Get final message
msg = s.get_final_message()
FastAPI SSE endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat")
async def chat(req: dict):
async def gen():
async with client.messages.stream(
model="claude-opus-4-7",
max_tokens=1024,
messages=req["messages"],
) as stream:
async for text in stream.text_stream:
yield f"data: {json.dumps({'text': text})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
gen(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no", # disable nginx buffering
},
)
Frontend EventSource
const es = new EventSource("/chat");
es.onmessage = (e) => {
if (e.data === "[DONE]") {
es.close();
return;
}
const { text } = JSON.parse(e.data);
setResponse(prev => prev + text);
};
EventSource is GET only. For POST, use fetch + ReadableStream.
Fetch streaming (POST + body)
const res = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE lines
const lines = buffer.split("\n");
buffer = lines.pop();
for (const line of lines) {
if (line.startsWith("data: ")) {
const data = line.slice(6);
if (data === "[DONE]") return;
const { text } = JSON.parse(data);
setResponse(prev => prev + text);
}
}
}
Next.js route handler
export async function POST(req: Request) {
const { messages } = await req.json();
const stream = new ReadableStream({
async start(controller) {
const encoder = new TextEncoder();
const openaiStream = await openai.chat.completions.create({
model: "gpt-5",
messages,
stream: true,
});
for await (const chunk of openaiStream) {
const text = chunk.choices[0]?.delta?.content || "";
if (text) {
controller.enqueue(encoder.encode(`data: ${JSON.stringify({text})}\n\n`));
}
}
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
},
});
}
export const runtime = "edge";
Vercel AI SDK (simplest)
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai("gpt-5"),
messages,
});
return result.toDataStreamResponse();
}
Client:
import { useChat } from "ai/react";
function Chat() {
const { messages, input, handleSubmit, handleInputChange } = useChat();
return (
<>
{messages.map(m => <div key={m.id}>{m.content}</div>)}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
</form>
</>
);
}
Handles streaming + state. Highly recommended.
Streaming JSON
For structured output you want to consume as it streams:
import json_stream
# Parse partial JSON as it streams
# Or use Anthropic's structured streaming events
Or buffer until valid JSON.
Cancellation
# Client disconnects → cancel LLM stream
async def gen():
try:
async with client.messages.stream(...) as s:
async for text in s.text_stream:
yield text
except asyncio.CancelledError:
# Client gone; stop billing
pass
Nginx buffering
Disable nginx buffering for SSE:
location /chat {
proxy_pass http://app;
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 24h;
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding off;
}
Backpressure
If client slow, accumulating chunks burns memory. Yield + flush regularly.
Token-by-token vs chunk
OpenAI/Anthropic stream tokens. Browser renders per chunk. For super-smooth UX, batch every few tokens.
Showing tool calls during stream
case "tool_call_start": showToolUI(name);
case "tool_call_result": updateToolUI(result);
case "text_delta": append(text);
UX: show “Calling search_web…” while tool runs.
Common mistakes
- Forgetting
X-Accel-Buffering: no→ nginx buffers full response. - Not flushing → chunks held in proxy buffer.
- Memory leak from unclosed streams.
- No timeout → hung connections.
- Race condition: client cancels but server keeps billing.
Read this next
If you want my streaming chat template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .