A health check that returns “OK” while the service can’t actually serve users is worse than no health check. This post is how to design probes that don’t lie.
Liveness vs Readiness
Liveness: “Am I broken in a way restart would fix?”
- Returns 200 unless the process is genuinely stuck (deadlock, panic state).
- Failure → kubelet restarts the pod.
- Should NOT depend on external services. A DB outage shouldn’t kill every pod.
Readiness: “Am I ready to serve traffic right now?”
- Returns 200 only when DB connections work, caches are warm, dependencies reachable.
- Failure → kubelet removes pod from service endpoints.
- Pod stays running; will be reinstated when ready.
# k8s deployment
livenessProbe:
httpGet: { path: /livez, port: 8080 }
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet: { path: /readyz, port: 8080 }
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 2
Mixing the two is a classic anti-pattern. A startup-slow app fails liveness; gets restarted; fails again; CrashLoop forever.
Implementation
@app.get("/livez")
async def livez():
return {"status": "ok"} # always returns 200 unless process is dead
@app.get("/readyz")
async def readyz():
if not await db_reachable():
return JSONResponse({"status": "db_unreachable"}, 503)
if not await redis_reachable():
return JSONResponse({"status": "redis_unreachable"}, 503)
return {"status": "ok"}
Liveness is dirt-simple. Readiness checks dependencies that, if unreachable, mean the pod can’t serve.
Startup probes
For slow-starting apps:
startupProbe:
httpGet: { path: /readyz, port: 8080 }
failureThreshold: 30 # 30 × 5s = 150s grace
periodSeconds: 5
While startup probe is failing, liveness/readiness aren’t run. Once startup succeeds, normal cadence kicks in. Prevents “slow boot triggers liveness restart” loops.
Dependency checks done right
Readiness should reflect what your service genuinely needs to serve traffic. Common dependencies:
- DB connection.
- Redis / cache.
- Critical downstream APIs.
But: a partial outage shouldn’t drop all your pods. Strategies:
- Cached health: probe deps every N seconds; cache result; readiness reads cache.
- Soft fail: degrade gracefully — readiness returns OK but a header indicates degraded mode.
- Explicit per-feature:
/readyz/orderschecks order-flow deps;/readyz/paymentschecks payment-flow deps.
SLO-aligned health
A pod that’s serving but with 50% error rate is unhealthy. Readiness based on real metrics:
@app.get("/readyz")
async def readyz():
if recent_error_rate() > 0.10:
return JSONResponse({"status": "high_error_rate"}, 503)
return {"status": "ok"}
Combined with SLOs and Error Budgets , readiness becomes self-correcting under partial failures.
Common mistakes
1. Liveness checks DB
DB blip → all pods restart → service down. Catastrophic. Liveness is local-only.
2. Readiness too aggressive
A 100ms blip → pod removed from endpoints → traffic shifted → another blip → another shift. Cascading. Use failureThreshold > 1.
3. Same endpoint for both
Behavior diverges; bugs. Different paths.
4. No timeout
Probe hangs indefinitely if app stalls. Set timeout.
5. Probe too cheap to detect issues
return {"ok"} always returns OK regardless. Make readiness reflect actual readiness.
What I’d ship today
For every service:
startupProbe: failureThreshold=30, periodSeconds=5 → /readyz
livenessProbe: periodSeconds=30, threshold=3 → /livez (always 200 unless broken)
readinessProbe: periodSeconds=10, threshold=2 → /readyz (deps + recent-error-rate)
Three probes. Each does one thing. No restart loops. No accidental traffic to broken pods.
Read this next
- SLOs and Error Budgets for App Developers
- Circuit Breakers, Bulkheads, and Backpressure
- Kubernetes for App Developers
- Zero-Downtime Deployments in 2026
If you want my probe-template across FastAPI / Hono / Axum, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .