Zero-downtime deployments aren’t magic. They’re a small set of patterns applied consistently. This post is the working playbook.
The three shapes
Rolling
Replace pods one (or N) at a time. Default for Kubernetes Deployments.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
Simple. Works. Slightly slower to fully roll out than blue/green.
Blue/Green
Two complete environments. Switch traffic from blue to green instantly.
- Pros: Atomic cutover. Easy rollback (point traffic back).
- Cons: 2× resources during deploy.
Canary
Send a small % of traffic to the new version. Watch metrics. Ramp up if healthy.
- Pros: Catches regressions on real traffic before full rollout.
- Cons: More complex; needs metrics-driven decision making.
For 2026 user-facing services, canary is the default. Tools like Argo Rollouts, Flagger automate the ramp.
Argo Rollouts canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: api }
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 5m }
- analysis:
templates: [{ templateName: error-rate }]
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
Combined with an AnalysisTemplate that watches Prometheus error rate / latency, the rollout pauses or aborts automatically on regressions.
For SLOs and Error Budgets wired to deploys, this is the production pattern.
Feature flags for app changes
Code change without traffic shift:
if flag("new-checkout-flow", user):
return new_checkout()
return old_checkout()
Deploy code dark, ramp via flag. Zero infrastructure risk; controlled exposure. See Feature Flags and Progressive Delivery .
Schema migrations — the multi-step pattern
The classic mistake: drop a column in the same release that stops using it. The new code runs against the old schema; old pods may run against the new schema. Bugs.
The right pattern, expand-and-contract:
- Expand. Add the new shape (nullable). Old code ignores; new code uses if present.
- Backfill old rows.
- Switch reads. Both shapes work; new code reads the new shape.
- Switch writes. New writes go to the new shape; old code becomes incompatible.
- Contract. Remove the old shape.
Each step is its own deploy. None block.
For the Postgres-side patterns see Postgres CDC and PostgreSQL 18 Features .
Connection draining
When a pod is removed:
- Cordoning: stop sending it new requests.
- Allow in-flight requests to complete (terminationGracePeriodSeconds).
- Send SIGTERM; app shuts down gracefully.
- SIGKILL if it doesn’t exit in time.
Set terminationGracePeriodSeconds: 60 (or your worst-case request length).
App must:
- Trap SIGTERM.
- Drain in-flight requests.
- Close DB connections.
- Exit.
Rollback strategy
Always have one. Fast.
- Argo CD: revert the Git commit; auto-sync deploys old version.
- kubectl rollout undo deployment/api.
- Feature flag: turn off the new behavior.
Practice rollback in non-prod. Speed matters in incidents.
CI gating
Before deploy:
- Type checks pass.
- Unit + integration tests pass.
- Build image succeeds.
- Image scanned (no high CVEs).
- Image signed.
- For agents: evals pass.
For supply-chain security see Software Supply Chain Security .
What I’d ship today
For a typical 2026 web service:
- Argo CD for GitOps .
- Argo Rollouts for canary.
- Feature flags for app-level rollouts.
- Expand-and-contract for schema migrations.
- Connection draining with realistic grace periods.
- One-click rollback documented and tested.
Boring. Reliable.
Read this next
- GitOps with Argo CD and Flux Explained
- Feature Flags and Progressive Delivery
- SLOs and Error Budgets for App Developers
- Incident Response and Blameless Postmortems
If you want my Argo Rollouts + AnalysisTemplate templates, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .