A health check that returns “OK” while the service can’t actually serve users is worse than no health check. This post is how to design probes that don’t lie.

Liveness vs Readiness

Liveness: “Am I broken in a way restart would fix?”

  • Returns 200 unless the process is genuinely stuck (deadlock, panic state).
  • Failure → kubelet restarts the pod.
  • Should NOT depend on external services. A DB outage shouldn’t kill every pod.

Readiness: “Am I ready to serve traffic right now?”

  • Returns 200 only when DB connections work, caches are warm, dependencies reachable.
  • Failure → kubelet removes pod from service endpoints.
  • Pod stays running; will be reinstated when ready.
# k8s deployment
livenessProbe:
  httpGet: { path: /livez, port: 8080 }
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3
readinessProbe:
  httpGet: { path: /readyz, port: 8080 }
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 2

Mixing the two is a classic anti-pattern. A startup-slow app fails liveness; gets restarted; fails again; CrashLoop forever.

Implementation

@app.get("/livez")
async def livez():
    return {"status": "ok"}        # always returns 200 unless process is dead
@app.get("/readyz")
async def readyz():
    if not await db_reachable():
        return JSONResponse({"status": "db_unreachable"}, 503)
    if not await redis_reachable():
        return JSONResponse({"status": "redis_unreachable"}, 503)
    return {"status": "ok"}

Liveness is dirt-simple. Readiness checks dependencies that, if unreachable, mean the pod can’t serve.

Startup probes

For slow-starting apps:

startupProbe:
  httpGet: { path: /readyz, port: 8080 }
  failureThreshold: 30        # 30 × 5s = 150s grace
  periodSeconds: 5

While startup probe is failing, liveness/readiness aren’t run. Once startup succeeds, normal cadence kicks in. Prevents “slow boot triggers liveness restart” loops.

Dependency checks done right

Readiness should reflect what your service genuinely needs to serve traffic. Common dependencies:

  • DB connection.
  • Redis / cache.
  • Critical downstream APIs.

But: a partial outage shouldn’t drop all your pods. Strategies:

  • Cached health: probe deps every N seconds; cache result; readiness reads cache.
  • Soft fail: degrade gracefully — readiness returns OK but a header indicates degraded mode.
  • Explicit per-feature: /readyz/orders checks order-flow deps; /readyz/payments checks payment-flow deps.

SLO-aligned health

A pod that’s serving but with 50% error rate is unhealthy. Readiness based on real metrics:

@app.get("/readyz")
async def readyz():
    if recent_error_rate() > 0.10:
        return JSONResponse({"status": "high_error_rate"}, 503)
    return {"status": "ok"}

Combined with SLOs and Error Budgets , readiness becomes self-correcting under partial failures.

Common mistakes

1. Liveness checks DB

DB blip → all pods restart → service down. Catastrophic. Liveness is local-only.

2. Readiness too aggressive

A 100ms blip → pod removed from endpoints → traffic shifted → another blip → another shift. Cascading. Use failureThreshold > 1.

3. Same endpoint for both

Behavior diverges; bugs. Different paths.

4. No timeout

Probe hangs indefinitely if app stalls. Set timeout.

5. Probe too cheap to detect issues

return {"ok"} always returns OK regardless. Make readiness reflect actual readiness.

What I’d ship today

For every service:

startupProbe:    failureThreshold=30, periodSeconds=5  → /readyz
livenessProbe:   periodSeconds=30, threshold=3         → /livez (always 200 unless broken)
readinessProbe:  periodSeconds=10, threshold=2         → /readyz (deps + recent-error-rate)

Three probes. Each does one thing. No restart loops. No accidental traffic to broken pods.

Read this next

If you want my probe-template across FastAPI / Hono / Axum, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .