The classic gRPC scaling story: “we added 5 more pods and load didn’t shift.” The cause is almost always wrong load balancing. This post is the working playbook.

The HTTP/2 problem

HTTP/1.1 closes the connection between requests; an L4 LB sees each request as a separate TCP connection and balances naturally.

HTTP/2 (used by gRPC) holds one persistent connection. Many requests multiplex over it. An L4 LB pins that connection to a server; all your traffic ends up on whichever pod the client connected to first.

Adding pods doesn’t help. Existing connections stay where they were.

Client-side load balancing

conn, err := grpc.NewClient(
    "dns:///api.svc.cluster.local:50051",
    grpc.WithDefaultServiceConfig(`{
        "loadBalancingConfig": [{"round_robin":{}}]
    }`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

dns:/// resolver returns all backend IPs; client opens one connection per backend; round-robins requests across them.

Pair with headless service in Kubernetes:

apiVersion: v1
kind: Service
metadata: { name: api }
spec:
  clusterIP: None     # headless — DNS returns pod IPs directly
  selector: { app: api }
  ports: [{ port: 50051 }]

Now dns:///api returns all pod IPs; client connects to each.

Service mesh balancing

In a Cilium / Istio mesh, the data plane sees gRPC at L7 and balances per-RPC:

  • Linkerd: built-in.
  • Cilium Service Mesh: built-in.
  • Istio: yes, with Envoy as data plane.
  • AWS App Mesh: yes.

Mesh removes the load-balancing concern from clients. Recommended for any non-trivial gRPC fleet.

Subset balancing

For very large fleets, round-robin across 1000 backends isn’t practical (each client maintains 1000 connections). Subset balancing:

  • Each client picks a stable subset (e.g., 20 of 1000 backends).
  • Round-robins within subset.
  • Periodically rotates subset to avoid hotspots.

Envoy / Cilium / Linkerd implement this automatically. For raw gRPC, configure subset selection in service config.

Custom resolvers

For non-DNS service discovery (Consul, etcd, custom):

import "google.golang.org/grpc/resolver"

// Register a custom resolver
resolver.Register(&consulResolver{})

conn, _ := grpc.NewClient("consul:///api", ...)

The resolver hands gRPC a list of addresses; gRPC handles the connection and balancing.

Outbound connection pooling

A typical service calls many backends. Without care, you have N connections to each. Connection pooling on the client:

// gRPC uses one connection per backend by default.
// Multiple connections to the same backend (for HOL avoidance):
conn, _ := grpc.NewClient(addr,
    grpc.WithDefaultServiceConfig(`{
        "loadBalancingConfig": [{"round_robin":{}}],
        "channelArgs": {"grpc.max_concurrent_streams": 100}
    }`),
)

For most workloads, the default per-backend connection is fine. Multiple connections per backend matter only when head-of-line blocking on a single connection becomes the bottleneck.

Health checks

gRPC has a standard health check protocol (grpc.health.v1). Both K8s and meshes use it:

import healthv1 "google.golang.org/grpc/health/grpc_health_v1"
import "google.golang.org/grpc/health"

healthsvc := health.NewServer()
healthv1.RegisterHealthServer(s, healthsvc)
healthsvc.SetServingStatus("orders.OrderService", healthv1.HealthCheckResponse_SERVING)

For probe configuration see Health Checks That Don’t Lie .

Common mistakes

1. ClusterIP service for gRPC

L4 balancing → connection pinning. Headless or mesh.

2. No retries on transient

gRPC can fail with UNAVAILABLE on connection blips. Retry policy in service config:

"methodConfig": [{
  "name": [{"service": "orders.OrderService"}],
  "retryPolicy": {
    "maxAttempts": 3,
    "initialBackoff": "0.1s",
    "maxBackoff": "1s",
    "backoffMultiplier": 2,
    "retryableStatusCodes": ["UNAVAILABLE"]
  }
}]

3. Connection lifetime forever

A connection opened at app start stays forever, never rebalances when backends scale. Set MaxConnectionAge:

keepalive.ServerParameters{ MaxConnectionAge: 30 * time.Minute }

Periodically forces clients to reconnect; rebalances naturally.

4. Ignoring deadlines

Without deadlines, slow backends cascade. Always context.WithTimeout.

5. No subsetting at scale

10k clients × 1000 backends = 10M connections. Use subset balancing.

Read this next

If you want a Go gRPC client with proper LB + retries + deadlines, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .