Nginx HA cheatsheet.
Approaches
- Cloud LB in front of nginx: simplest. Cloud handles VIP + health checks.
- keepalived (VRRP): shared VIP between 2+ nginx hosts.
- DNS round-robin / weighted: lightweight, slow failover.
- Anycast BGP: global, complex.
Cloud LB (preferred)
[Cloud LB] → [nginx-1, nginx-2, nginx-3] → [backends]
Auto-scaling group of nginx instances. Health check at LB level removes unhealthy.
Health endpoint:
location = /health {
access_log off;
return 200 "ok";
}
keepalived (Linux)
/etc/keepalived/keepalived.conf (primary):
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 150
advert_int 1
authentication { auth_type PASS; auth_pass secret; }
virtual_ipaddress { 192.168.1.100 }
track_script { chk_nginx }
}
vrrp_script chk_nginx {
script "pidof nginx"
interval 2
weight -20
}
Secondary: same but state BACKUP and priority 100.
If primary’s nginx dies, priority drops, secondary takes VIP.
Active-passive vs active-active
- Active-passive: VIP only on one host at a time. Simple.
- Active-active: VIP load-balanced (anycast, ECMP, cloud LB). More complex.
Session sharing
Multi-instance nginx + cookies/sessions:
- Use cookie/sticky session:
ip_hash(weak), Nginx Plussticky. - Or use Redis-backed sessions in app — no sticky needed.
DNS-based HA
A example.com 1.2.3.4 ttl=60
A example.com 5.6.7.8 ttl=60
Browsers round-robin. Failover is slow (TTL + caching).
Better: health-checked DNS (Route53, NS1, Cloudflare).
Blue-green deploy
Two nginx pools blue and green. Cloud LB or DNS points at active.
# Switch
dns-or-lb set example.com → green
sleep 60
update-app green
dns-or-lb set example.com → blue
sleep 60
update-app blue
Rolling upgrade (single host)
nginx -s reload # picks up new config without dropping conns
Or binary upgrade:
kill -USR2 $NGINX_PID # spawn new master
kill -WINCH $NGINX_PID # gracefully stop old workers
kill -QUIT $NGINX_PID # stop old master
Zero downtime upgrade.
Configuration sync across hosts
Use a config management tool: Ansible, Puppet, Terraform, GitOps.
# Ansible playbook (snippet)
- name: deploy nginx config
copy: src=nginx.conf dest=/etc/nginx/nginx.conf
notify: reload nginx
Anycast (global)
Same IP advertised from multiple POPs via BGP. Routers send packets to “nearest.” Used by CDNs / DNS providers.
Requires owning a /24 IP block + BGP peering.
Multi-region (active-active)
DNS / GeoDNS / global LB
↓
[US region nginx → US backends]
[EU region nginx → EU backends]
[AP region nginx → AP backends]
Each region serves locally. Stateful tier replicates (cross-region DB).
Cross-region failover
- DNS health checks (Route53 latency + failover policy).
- Cloud global LB (GCP, AWS Global Accelerator).
- Application-level (CDN with origin failover).
Backups
Nginx is stateless mostly. Things to back up:
/etc/nginx/.- TLS certs (
/etc/letsencrypt/). - Custom Lua / configs.
Disaster recovery
Treat nginx hosts as cattle. Rebuild from config repo + cert manager + auto-renewal. Time to recovery should be minutes.
Monitoring
- HTTP-level: 4xx/5xx rates, latency.
- nginx-level: active connections, accepts, handled.
- Host-level: CPU, mem, network, fds.
Alert when:
- Any nginx instance unreachable.
- 5xx > 1%.
- Latency p95 > X.
- TLS cert expiring < 14 days.
Graceful drain on shutdown
Before killing nginx, deregister from LB so traffic stops:
# LB API: deregister
# Wait 30s
# Then: systemctl stop nginx
Common mistakes
- Single nginx without LB / VIP → SPOF.
- Stale
nginx.confacross hosts. - Cert renewal on one host but not synced.
- Sticky sessions hiding broken session-sharing logic.
- DNS TTL = 86400 → failover takes a day.
Read this next
If you want my HA + keepalived setup, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .