Deploy strategies are means, not ends. The question is “what failure mode do you want to mitigate?” Rolling is fine; canary catches mistakes earlier; blue/green enables instant cutover. This post is the working playbook.

Rolling deploy

K8s native; default for Deployments.

Replicas: [v1, v1, v1, v1]
Step 1:   [v1, v1, v1, v2]    # one v2 added
Step 2:   [v1, v1, v2, v2]    # one v1 removed, one v2 added
...
Final:    [v2, v2, v2, v2]

Pros: simple, no extra infra, no traffic split logic. Cons: bad version slowly takes over before metrics catch up; rollback time same as deploy time.

For most services: rolling is enough.

Canary deploy

Small % of traffic gets new version; rest stays on old. Watch metrics; promote or rollback.

v1: 95% traffic
v2: 5% traffic
[wait + analyze]
v1: 80%, v2: 20%
[wait + analyze]
v1: 50%, v2: 50%
... promote to 100%

Pros: bad versions affect fewer users; metrics-driven promotion. Cons: needs traffic-splitting infra; analysis logic; longer total deploy time.

For high-traffic public APIs: canary.

Blue/Green

Two full environments; switch traffic atomically.

Blue (current):  ─────┐
                       ├──> [LB] → users
Green (deploying): ───┘

Deploy to Green; verify; switch LB to Green; Blue becomes standby.

Pros: instant cutover; instant rollback (switch back). Cons: 2× infra during deploy; stateful migrations need care.

For low-frequency big releases / DB schema cuts: blue/green.

Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: api }
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      analysis:
        templates: [{ templateName: success-rate }]
        startingStep: 2

Automated canary with metric analysis. Rolls back automatically if SLOs breach.

# AnalysisTemplate
metrics:
  - name: success-rate
    successCondition: result[0] >= 0.99
    provider:
      prometheus:
        query: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) /
          sum(rate(http_requests_total[5m]))

Bad version → metric drops → auto-rollback. Powerful when wired to real SLIs.

Flagger

Similar concept to Argo Rollouts. Mesh-aware (Istio, Linkerd, traefik). Picks based on existing mesh.

Service mesh integration

For canary, traffic splitting needs:

  • Istio / Linkerd: native traffic split.
  • Argo Rollouts + nginx-ingress: header-based.
  • Cloud LB: weighted target groups.

Without traffic-splitting capability, canary degrades to “small replica %” — different (and weaker) signal.

Header-based canary

Header: X-Canary: 1 → routed to v2.
No header → v1.

Internal users / specific cohort gets v2 first; promote based on feedback.

Useful for testing under real prod load with internal-only blast radius.

Database changes

The hardest part of deploys. Strategies:

  • Backwards-compatible migrations (expand-contract). See DB Migrations .
  • Feature flags decoupling code from schema.
  • Two-phase rollouts: deploy migration first; deploy code that uses it second.

Blue/green doesn’t help if schema is shared between blue and green (which it usually is).

Rollback

StrategyRollback time
RollingMinutes (re-deploy old)
CanarySeconds (route 100% to old)
Blue/GreenSeconds (LB switch)

Plan for rollback before deploying. The first 10 minutes after deploy are the most dangerous; have rollback ready.

Feature flags

Orthogonal to deploy strategy. Deploy code; toggle on at runtime.

if flags.is_enabled("new_search", user_id):
    return new_search(...)
return old_search(...)

Gradual rollout per user/segment. Decouples release from deploy. See Feature Flags 2026 .

Combine: rolling deploy + feature flag = deploy safely, enable gradually.

When each fits

Use when
RollingStateless service; default risk; default tooling
CanaryHigh-traffic public API; metrics-driven promotion
Blue/GreenStateful; rare big releases; instant rollback critical
RecreateStateful single-instance; downtime acceptable
A/BProduct experiment, not deploy strategy

Multi-region deploys

US-East:  v1 → v2
US-West:  v1 → v2 (after East stable)
EU:       v1 → v2 (after both US stable)
APAC:     v1 → v2 (after all stable)

Region-by-region. Issues caught in one region don’t propagate.

For very critical: bake time of hours/day per region.

Deploy frequency vs strategy

  • Continuous deploys (many per day): rolling + flags + tests.
  • Daily / weekly: rolling + canary on critical services.
  • Quarterly big releases: blue/green or extensive canary.

Match strategy to cadence. Continuous deploys with full canary every time = paralysis.

Common mistakes

1. Canary without metrics

You set up traffic split but don’t analyze; “deploy + pray with extra steps.”

2. No automated rollback

Deploy fails at 3am; on-call manually rolls back. Auto-rollback on SLI breach.

3. DB migration mid-deploy

Old code + new schema; subtle bugs. Migrations as separate, expand-contract, with overlap.

4. Long bake times for routine changes

Every deploy a 4-hour canary. Slows shipping. Risk-tier; canary high-risk; rolling for routine.

5. Different strategy per environment

Staging blue/green; prod rolling. Test what you ship. Mirror strategy between envs.

What I’d ship today

For new K8s apps:

  • Rolling by default.
  • Argo Rollouts / Flagger canary for customer-facing critical services.
  • Blue/green for rare stateful migrations.
  • Feature flags for product launches independent of deploys.
  • Auto-rollback wired to SLIs.
  • Multi-region progressive rollout for global services.

Read this next

If you want my Argo Rollouts canary template, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .