A load balancer is one of those infrastructure pieces that sits between your users and your app, and most app developers never think about until something goes wrong. “Why is one of my pods getting all the traffic?” “Why did the deploy take down all my websocket connections?” “Should I use ALB or NLB?” — these questions all live in load-balancer land.
This post explains what load balancers actually do, the algorithms that distribute traffic, the difference between L4 and L7 (and when it matters), and the tools that are worth knowing in 2026.
What a load balancer does
At its simplest: take incoming traffic and send each connection (or request) to one of N backends. That’s it. The interesting stuff is how it picks which backend.
In a typical setup:
[ Client ] → [ Load Balancer ] → [ Backend 1 ]
→ [ Backend 2 ]
→ [ Backend 3 ]
This gives you four big wins:
- Horizontal scale — add more backends to handle more traffic.
- Fault tolerance — one backend dies, the others keep serving.
- Zero-downtime deploys — drain traffic from one pod at a time.
- A single stable address for clients (DNS A record, IP, or hostname) — backends can come and go.
L4 vs L7: the layer matters
The biggest distinction in load balancers is which OSI layer they operate at:
Layer 4 (transport)
Operates on TCP/UDP. Sees connections, ports, source/dest IPs. Doesn’t look at the request itself.
- Forwards the raw connection to a backend.
- Very fast (no parsing, no decryption).
- Can handle any protocol on top of TCP/UDP — HTTP, gRPC, MySQL, Redis, your custom binary protocol.
Examples: AWS NLB, GCP Network Load Balancer, HAProxy in tcp mode, Envoy at L4.
Layer 7 (application)
Understands HTTP. Can route based on path, host, headers, cookies. Can terminate TLS. Can rewrite requests.
- Slower (has to parse and sometimes decrypt).
- Can do things L4 can’t: route
/api/v1/*to one fleet,/static/*to another; sticky sessions via cookies; rate limit by header; respond with 503 directly when backends are bad.
Examples: AWS ALB, GCP HTTP(S) LB, Nginx, Envoy at L7, HAProxy in http mode, Cloudflare.
Which to use?
- For HTTP APIs and websites: L7. The flexibility (routing rules, header manipulation, observability) is worth it.
- For non-HTTP TCP services (databases, message queues, custom protocols): L4.
- For raw throughput (millions of connections, low latency required): L4.
- For mTLS or end-to-end encryption that you don’t want the LB to terminate: L4.
Most app traffic in 2026 wants L7. AWS ALB, Cloudflare, GCP HTTPS LB — all L7.
The distribution algorithms
Once a request arrives, how does the LB pick a backend?
Round-robin
Cycle through backends in order. Backend 1, 2, 3, 1, 2, 3, …
Pros: simple, predictable, no state. Cons: treats all backends and all requests the same. A slow backend gets the same load as a fast one.
Weighted round-robin
Like round-robin but each backend has a weight. A backend with weight 2 gets twice the traffic of one with weight 1.
When to use: mixed instance sizes, canary deployments (5% to the new version, 95% to old).
Least connections
Send the next request to whichever backend has the fewest active connections.
Pros: naturally balances when backends process requests at different speeds. Cons: more state to track; not great for very short connections.
This is a sensible default for most HTTP APIs.
IP hash / consistent hash
Hash the client’s IP (or another key) and map to a backend. Same client always lands on the same backend.
Pros: session affinity without cookies; cache locality. Cons: uneven if a few clients send most traffic; rebalancing on backend changes.
Useful for caching servers where you want the same key to consistently hit the same node.
Random
Just pick a backend at random.
Pros: stateless and surprisingly even at scale. Cons: worse tail latency than least-connections.
Power of two choices
Pick two backends at random; send to whichever has fewer connections. Almost as good as least-connections but with much less state. Used internally by Envoy and many modern LBs.
Health checks
A backend that’s down should not receive traffic. Health checks decide which backends are eligible:
- Active health checks — LB pings each backend (e.g.
GET /healthevery 5s). - Passive health checks — LB observes real traffic and removes backends that fail.
You almost always want both. Active catches “process is down”; passive catches “process is up but failing.”
A few rules of thumb for designing the health endpoint:
- Make it cheap. It runs every few seconds per LB instance. It shouldn’t hit the DB if the DB hits the LB.
- Make it meaningful. Returning
200 OKfrom/healthwhile the DB is unreachable means the LB sends real traffic to a broken backend. Check the dependencies your app actually needs. - Make it boring. Don’t put new code in the health endpoint. It should be the most stable endpoint in your service.
A common pattern: /healthz is a fast liveness check (process responds), /readyz checks dependencies (DB, cache, upstreams). LB uses /readyz.
Sticky sessions (session affinity)
Some apps store per-user state in the backend’s memory (websockets, server-side sessions, in-process caches). When subsequent requests need to land on the same backend, you have two options:
- Cookie-based affinity (L7) — LB sets a cookie identifying the backend.
- IP-hash (L4 or L7) — derived from client IP.
Better solution: don’t need sticky sessions at all. Store session state in Redis or the DB; any backend can serve any request. This is the stateless backend pattern, and it’s what makes horizontal scaling actually work.
Reach for sticky sessions only when you can’t redesign the state to be external. Websockets are the most legitimate case — though even there, modern setups use a pub/sub layer (Redis, NATS) so any backend can deliver to any connection.
Connection draining and graceful deploys
When you remove a backend (deploy, scale-in), don’t just yank it. Give it time to finish in-flight requests:
- LB stops sending new connections to backend X.
- Backend X finishes existing requests.
- After the drain timeout, LB closes any remaining connections.
- Process exits.
This is connection draining (or “deregistration delay” in AWS-speak). Set it to ~30s for HTTP APIs. For longer-running requests (uploads, long-poll), longer.
In Kubernetes this is terminationGracePeriodSeconds on the pod plus a preStop hook that sleeps a few seconds before SIGTERM, giving the ingress controller time to update its backend list.
TLS termination: where do you decrypt?
Three patterns:
- At the LB — LB terminates TLS, talks plain HTTP to backends. Simplest. The “TLS to LB, HTTP inside” model is fine if your private network is trusted.
- At the backend — LB just passes encrypted traffic through (L4). Backends handle TLS. Useful when end-to-end encryption is a requirement (compliance) or you need mTLS.
- Re-encrypt — LB terminates client TLS, then opens a new TLS connection to the backend. End-to-end encryption with L7 features. Costs more CPU.
For most public APIs, terminate at the LB is the right answer. Your DB-side connections (LB → backend) sit on a private network you control.
Tools worth knowing
Open source
- Nginx — the workhorse. L4 and L7. Excellent for static content, reverse proxy, and as a TLS terminator.
- HAProxy — pure load balancer; very fast L4 and L7. The classic choice for serious scale before clouds existed.
- Envoy — modern, programmable, observable. The data plane behind Istio, AWS App Mesh, and many service meshes. Steeper learning curve.
- Traefik — designed for containers. Auto-discovers backends from Docker/Kubernetes labels. Easy on for K8s.
- Caddy — automatic HTTPS, simple config. Great for small/medium use cases.
Cloud-managed
- AWS ALB (Application LB) — L7, HTTPS, target groups, route rules. Default for most AWS HTTP services.
- AWS NLB (Network LB) — L4, very high throughput, static IPs. For non-HTTP or extreme scale.
- GCP HTTP(S) LB — global L7 with anycast.
- Cloudflare — also a CDN; L7 LB with DDoS protection at the edge.
Service meshes
- Istio, Linkerd, Consul Connect — full service mesh; per-service Envoy proxies handle E-W traffic between microservices, with mTLS, retries, circuit breaking.
- Useful when you have lots of internal services. Overkill when you don’t.
Cost considerations
A surprise factor in cloud LBs: traffic volume costs money. AWS ALB charges per LCU (Load Balancer Capacity Unit) which includes connections, bandwidth, and rule evaluations. At a million requests per day this is fine; at a billion, it adds up.
For very high-volume traffic, NLB (L4) is cheaper than ALB (L7). For traffic that’s mostly cacheable, putting a CDN (Cloudflare, CloudFront) in front of the LB cuts cost dramatically.
Common mistakes
- No health checks. Backends die; the LB keeps sending them traffic.
- Health endpoint that’s too smart. It checks the DB → DB hiccup → all backends marked unhealthy → cascade outage.
- No connection draining. Deploys cause client errors.
- Sticky sessions you didn’t need. Just store state externally.
- Round-robin on backends with very different capacity. Use weighted RR or least-connections.
- Ignoring
Connection: keep-alive. Long-lived HTTP/1.1 connections can stick to one backend for thousands of requests, defeating LB. Set sane keep-alive limits or move to HTTP/2 (multiplexed). - Single LB. The LB itself is a SPOF. Use a redundant pair, or a managed LB (which handles HA for you).
Conclusion
Load balancers are simple in concept and rich in detail. For most app developers, the working knowledge you need is: pick L7 for HTTP, set up health checks that mean something, configure connection draining, and store state outside backends so you don’t need stickiness. The rest you’ll learn by debugging the day a deploy goes sideways.
For more on the layers around load balancing, see Kubernetes for App Developers (Ingress is just a load balancer in a fancy hat) and Deploying Django to Production (Nginx as the simplest possible LB).
Happy balancing!
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .