A user clicks “Stop generating” and you want the LLM to stop. Implemented naively, the model keeps generating; you keep paying. This post is the working pattern.
Cancellation flow
Client clicks stop
↓
AbortController.abort() → fetch sends RST
↓
Server's request handler detects context cancel
↓
Cancels the LLM streaming call
↓
Anthropic/OpenAI stops generation
↓
Server stops billing
Each link must propagate the cancel.
Server-side (FastAPI)
from fastapi import Request
from fastapi.responses import StreamingResponse
@app.post("/chat")
async def chat(req: Request, payload: ChatIn):
async def generate():
async with anthropic.AsyncAnthropic().messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": payload.message}],
) as stream:
async for text in stream.text_stream:
if await req.is_disconnected():
break # client cancelled; close stream
yield f"data: {json.dumps({'text': text})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
is_disconnected() checks if the client closed the connection. The async with exits cleanly; the SDK tells Anthropic to stop. No more tokens generated.
Client-side
const ac = new AbortController();
document.getElementById("stop").addEventListener("click", () => ac.abort());
const resp = await fetch("/chat", {
method: "POST",
body: JSON.stringify({ message }),
signal: ac.signal, // propagates abort to fetch
});
const reader = resp.body!.getReader();
try {
while (true) {
const { value, done } = await reader.read();
if (done) break;
process(value);
}
} catch (e) {
if (ac.signal.aborted) {
// user cancelled
} else { throw e; }
}
AbortController.abort() triggers the cleanup chain.
With WebSockets
For interactive agents where users send mid-stream signals:
ws.send(JSON.stringify({ type: "cancel", message_id: currentId }));
async def handle_ws(ws):
cancel = asyncio.Event()
async for msg in ws.iter_json():
if msg["type"] == "cancel":
cancel.set()
elif msg["type"] == "prompt":
cancel.clear()
await stream_response(msg["content"], cancel, ws)
async def stream_response(content, cancel, ws):
async with anthropic.messages.stream(...) as stream:
async for text in stream.text_stream:
if cancel.is_set():
break
await ws.send_json({"type": "token", "text": text})
For SSE vs WebSockets choice see SSE vs WebSockets in 2026 .
Partial response handling
Cancelled mid-response → you have partial output. Decisions:
- Show what generated: useful for chat (user sees what got produced).
- Discard: for structured outputs (a half-JSON is useless).
- Retry-friendly idempotency: store partial; on retry resume? (Mostly impractical with current APIs.)
For chat, store-and-show is the default.
Cost savings
A user that frequently stops mid-response saves real money:
- Average response: 500 output tokens at $15/MTok = $0.0075.
- Cancelled at 100 tokens: $0.0015.
- Save: $0.006 per cancel.
At 1M cancels/month: $6k saved. Not nothing.
Common mistakes
1. Not propagating cancel to provider
Backend stops sending bytes to client but keeps billing because it didn’t tell Anthropic to stop. Use the SDK’s streaming context manager (async with stream:).
2. Browser fetch without AbortController
Tab closes; request keeps running on the server. Always signal: ac.signal.
3. No client UI for cancel
Long generations with no stop button → user closes tab → server keeps going. UX bug AND cost bug.
4. Forgetting to break after cancel detection
The loop keeps yielding; bytes accumulate; cancel doesn’t actually stop.
5. SSE without keepalive
Some proxies time out idle connections. Send : keepalive\n\n every 15s during slow generation.
Read this next
- SSE vs WebSockets in 2026
- Anthropic Claude API + Tool Use Guide
- LLM Cost Optimization in 2026
- Voice Agents and Realtime LLM APIs in 2026
If you want a working FastAPI + browser stream-with-cancel template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .