Deploy strategies are means, not ends. The question is “what failure mode do you want to mitigate?” Rolling is fine; canary catches mistakes earlier; blue/green enables instant cutover. This post is the working playbook.
Rolling deploy
K8s native; default for Deployments.
Replicas: [v1, v1, v1, v1]
Step 1: [v1, v1, v1, v2] # one v2 added
Step 2: [v1, v1, v2, v2] # one v1 removed, one v2 added
...
Final: [v2, v2, v2, v2]
Pros: simple, no extra infra, no traffic split logic. Cons: bad version slowly takes over before metrics catch up; rollback time same as deploy time.
For most services: rolling is enough.
Canary deploy
Small % of traffic gets new version; rest stays on old. Watch metrics; promote or rollback.
v1: 95% traffic
v2: 5% traffic
[wait + analyze]
v1: 80%, v2: 20%
[wait + analyze]
v1: 50%, v2: 50%
... promote to 100%
Pros: bad versions affect fewer users; metrics-driven promotion. Cons: needs traffic-splitting infra; analysis logic; longer total deploy time.
For high-traffic public APIs: canary.
Blue/Green
Two full environments; switch traffic atomically.
Blue (current): ─────┐
├──> [LB] → users
Green (deploying): ───┘
Deploy to Green; verify; switch LB to Green; Blue becomes standby.
Pros: instant cutover; instant rollback (switch back). Cons: 2× infra during deploy; stateful migrations need care.
For low-frequency big releases / DB schema cuts: blue/green.
Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: api }
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates: [{ templateName: success-rate }]
startingStep: 2
Automated canary with metric analysis. Rolls back automatically if SLOs breach.
# AnalysisTemplate
metrics:
- name: success-rate
successCondition: result[0] >= 0.99
provider:
prometheus:
query: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
Bad version → metric drops → auto-rollback. Powerful when wired to real SLIs.
Flagger
Similar concept to Argo Rollouts. Mesh-aware (Istio, Linkerd, traefik). Picks based on existing mesh.
Service mesh integration
For canary, traffic splitting needs:
- Istio / Linkerd: native traffic split.
- Argo Rollouts + nginx-ingress: header-based.
- Cloud LB: weighted target groups.
Without traffic-splitting capability, canary degrades to “small replica %” — different (and weaker) signal.
Header-based canary
Header: X-Canary: 1 → routed to v2.
No header → v1.
Internal users / specific cohort gets v2 first; promote based on feedback.
Useful for testing under real prod load with internal-only blast radius.
Database changes
The hardest part of deploys. Strategies:
- Backwards-compatible migrations (expand-contract). See DB Migrations .
- Feature flags decoupling code from schema.
- Two-phase rollouts: deploy migration first; deploy code that uses it second.
Blue/green doesn’t help if schema is shared between blue and green (which it usually is).
Rollback
| Strategy | Rollback time |
|---|---|
| Rolling | Minutes (re-deploy old) |
| Canary | Seconds (route 100% to old) |
| Blue/Green | Seconds (LB switch) |
Plan for rollback before deploying. The first 10 minutes after deploy are the most dangerous; have rollback ready.
Feature flags
Orthogonal to deploy strategy. Deploy code; toggle on at runtime.
if flags.is_enabled("new_search", user_id):
return new_search(...)
return old_search(...)
Gradual rollout per user/segment. Decouples release from deploy. See Feature Flags 2026 .
Combine: rolling deploy + feature flag = deploy safely, enable gradually.
When each fits
| Use when | |
|---|---|
| Rolling | Stateless service; default risk; default tooling |
| Canary | High-traffic public API; metrics-driven promotion |
| Blue/Green | Stateful; rare big releases; instant rollback critical |
| Recreate | Stateful single-instance; downtime acceptable |
| A/B | Product experiment, not deploy strategy |
Multi-region deploys
US-East: v1 → v2
US-West: v1 → v2 (after East stable)
EU: v1 → v2 (after both US stable)
APAC: v1 → v2 (after all stable)
Region-by-region. Issues caught in one region don’t propagate.
For very critical: bake time of hours/day per region.
Deploy frequency vs strategy
- Continuous deploys (many per day): rolling + flags + tests.
- Daily / weekly: rolling + canary on critical services.
- Quarterly big releases: blue/green or extensive canary.
Match strategy to cadence. Continuous deploys with full canary every time = paralysis.
Common mistakes
1. Canary without metrics
You set up traffic split but don’t analyze; “deploy + pray with extra steps.”
2. No automated rollback
Deploy fails at 3am; on-call manually rolls back. Auto-rollback on SLI breach.
3. DB migration mid-deploy
Old code + new schema; subtle bugs. Migrations as separate, expand-contract, with overlap.
4. Long bake times for routine changes
Every deploy a 4-hour canary. Slows shipping. Risk-tier; canary high-risk; rolling for routine.
5. Different strategy per environment
Staging blue/green; prod rolling. Test what you ship. Mirror strategy between envs.
What I’d ship today
For new K8s apps:
- Rolling by default.
- Argo Rollouts / Flagger canary for customer-facing critical services.
- Blue/green for rare stateful migrations.
- Feature flags for product launches independent of deploys.
- Auto-rollback wired to SLIs.
- Multi-region progressive rollout for global services.
Read this next
If you want my Argo Rollouts canary template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .