Nginx Cheatsheet 16 — High Availability

Nginx HA cheatsheet.

Approaches

Cloud LB in front of nginx: simplest. Cloud handles VIP + health checks.
keepalived (VRRP): shared VIP between 2+ nginx hosts.
DNS round-robin / weighted: lightweight, slow failover.
Anycast BGP: global, complex.

Cloud LB (preferred)

[Cloud LB] → [nginx-1, nginx-2, nginx-3] → [backends]

Auto-scaling group of nginx instances. Health check at LB level removes unhealthy.

Health endpoint:

location = /health {
    access_log off;
    return 200 "ok";
}

keepalived (Linux)

/etc/keepalived/keepalived.conf (primary):

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 150
    advert_int 1
    authentication { auth_type PASS; auth_pass secret; }
    virtual_ipaddress { 192.168.1.100 }
    track_script { chk_nginx }
}

vrrp_script chk_nginx {
    script "pidof nginx"
    interval 2
    weight -20
}

Secondary: same but state BACKUP and priority 100.

If primary’s nginx dies, priority drops, secondary takes VIP.

Active-passive vs active-active

Active-passive: VIP only on one host at a time. Simple.
Active-active: VIP load-balanced (anycast, ECMP, cloud LB). More complex.

Multi-instance nginx + cookies/sessions:

Use cookie/sticky session: ip_hash (weak), Nginx Plus sticky.
Or use Redis-backed sessions in app — no sticky needed.

DNS-based HA

A example.com 1.2.3.4 ttl=60
A example.com 5.6.7.8 ttl=60

Browsers round-robin. Failover is slow (TTL + caching).

Better: health-checked DNS (Route53, NS1, Cloudflare).

Blue-green deploy

Two nginx pools blue and green. Cloud LB or DNS points at active.

# Switch
dns-or-lb set example.com → green
sleep 60
update-app green
dns-or-lb set example.com → blue
sleep 60
update-app blue

Rolling upgrade (single host)

nginx -s reload                    # picks up new config without dropping conns

Or binary upgrade:

kill -USR2 $NGINX_PID              # spawn new master
kill -WINCH $NGINX_PID             # gracefully stop old workers
kill -QUIT $NGINX_PID              # stop old master

Zero downtime upgrade.

Configuration sync across hosts

Use a config management tool: Ansible, Puppet, Terraform, GitOps.

# Ansible playbook (snippet)
- name: deploy nginx config
  copy: src=nginx.conf dest=/etc/nginx/nginx.conf
  notify: reload nginx

Anycast (global)

Same IP advertised from multiple POPs via BGP. Routers send packets to “nearest.” Used by CDNs / DNS providers.

Requires owning a /24 IP block + BGP peering.

Multi-region (active-active)

DNS / GeoDNS / global LB
       ↓
[US region nginx → US backends]
[EU region nginx → EU backends]
[AP region nginx → AP backends]

Each region serves locally. Stateful tier replicates (cross-region DB).

Cross-region failover

DNS health checks (Route53 latency + failover policy).
Cloud global LB (GCP, AWS Global Accelerator).
Application-level (CDN with origin failover).

Backups

Nginx is stateless mostly. Things to back up:

/etc/nginx/.
TLS certs (/etc/letsencrypt/).
Custom Lua / configs.

Disaster recovery

Treat nginx hosts as cattle. Rebuild from config repo + cert manager + auto-renewal. Time to recovery should be minutes.

Monitoring

HTTP-level: 4xx/5xx rates, latency.
nginx-level: active connections, accepts, handled.
Host-level: CPU, mem, network, fds.

Alert when:

Any nginx instance unreachable.
5xx > 1%.
Latency p95 > X.
TLS cert expiring < 14 days.

Graceful drain on shutdown

Before killing nginx, deregister from LB so traffic stops:

# LB API: deregister
# Wait 30s
# Then: systemctl stop nginx

Common mistakes

Single nginx without LB / VIP → SPOF.
Stale nginx.conf across hosts.
Cert renewal on one host but not synced.
Sticky sessions hiding broken session-sharing logic.
DNS TTL = 86400 → failover takes a day.

Approaches#

Cloud LB (preferred)#

keepalived (Linux)#

Active-passive vs active-active#

Session sharing#

DNS-based HA#

Blue-green deploy#

Rolling upgrade (single host)#

Configuration sync across hosts#

Anycast (global)#

Multi-region (active-active)#

Cross-region failover#

Backups#

Disaster recovery#

Monitoring#

Graceful drain on shutdown#

Common mistakes#

Read this next#