Multi-cluster Kubernetes cheatsheet.

Why multi-cluster

  • Regional / latency.
  • Blast radius (per env, per team).
  • Compliance (data residency).
  • HA failover.
  • Heterogeneous workloads (GPU clusters separate).

Strategies

  1. One cluster per env: simple, well-isolated.
  2. One cluster per region: latency / data sovereignty.
  3. Per team / per product: large orgs.
  4. Single huge cluster with namespaces: only when fewer clusters is genuinely better.

kubeconfig multi-cluster

# ~/.kube/config
clusters:
  - name: dev
    cluster: { server: https://dev.example.com, certificate-authority-data: ... }
  - name: prod
    cluster: { server: https://prod.example.com, certificate-authority-data: ... }

contexts:
  - name: dev-admin
    context: { cluster: dev, user: admin, namespace: default }
  - name: prod-admin
    context: { cluster: prod, user: admin }

users:
  - name: admin
    user: { token: ... }
kubectl config get-contexts
kubectl config use-context prod-admin

Or kubectx for fast switching.

ArgoCD ApplicationSet (cluster generator)

spec:
  generators:
    - clusters: {}            # all registered
  template:
    metadata: { name: 'web-{{name}}' }
    spec:
      destination:
        server: '{{server}}'
        namespace: prod

Auto-creates Application per cluster.

Register cluster with ArgoCD

argocd cluster add my-context-name

ArgoCD installs a token-using SA in the target cluster.

Cluster API (CAPI)

Declarative cluster lifecycle (provision/upgrade/destroy clusters via K8s API):

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata: { name: my-cluster }
spec:
  clusterNetwork: { pods: { cidrBlocks: [10.0.0.0/16] } }
  infrastructureRef: { name: my-cluster, kind: AWSCluster, apiVersion: ... }
  controlPlaneRef: { name: my-cluster, kind: KubeadmControlPlane, apiVersion: ... }

Providers: AWS, GCP, Azure, vSphere, Hetzner, etc.

Fleet (Rancher)

GitOps for many clusters:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata: { name: my-app }
spec:
  repo: https://github.com/me/manifests
  targets:
    - clusterGroup: prod

Multi-cluster service mesh

Cilium Cluster Mesh, Istio multi-cluster, Linkerd multi-cluster — extend services across clusters with cross-cluster DNS, mTLS, traffic shaping.

Submariner

VPN-less cross-cluster connectivity. Pods in cluster A can reach services in cluster B.

Global load balancing

  • DNS-based: Route53 latency/geolocation, Cloudflare, NS1.
  • Anycast: cloud LB anycast (GCP global LB).
  • Application-level: CDN with origin failover.

Secrets propagation

  • ExternalSecrets pulling from central Vault.
  • Sealed Secrets with shared cluster cert (avoid).
  • KMS-based encryption.

Image distribution

Use a regional registry mirror or content-addressed registry (Harbor with replication).

Backup across clusters (Velero)

helm install velero vmware-tanzu/velero \
  --set provider=aws \
  --set bucket=my-backup-bucket
velero backup create my-backup --include-namespaces=prod
velero restore create --from-backup=my-backup

Disaster recovery across clusters.

Crossplane

Manage cloud resources (DBs, queues, IAM) as K8s resources from any cluster:

apiVersion: dynamodb.aws.upbound.io/v1beta1
kind: Table
metadata: { name: my-table }
spec:
  forProvider:
    region: us-east-1
    hashKey: id
    attribute:
      - { name: id, type: S }
    readCapacity: 5
    writeCapacity: 5

Monitoring federation

Single Grafana, multiple Prometheus per cluster, federated query (Thanos / Cortex / Mimir):

# Thanos sidecar to upload to S3
# Thanos Querier aggregates from all sidecars

Active-active vs active-passive

  • Active-active: load split across clusters. Need stateful tier replication (cross-region DB replicas).
  • Active-passive: one serves, other ready for failover. Simpler.

Networking patterns

  • Service-mesh global mTLS.
  • VPC peering for cross-region private links.
  • Public LBs with health checks for failover.

Cluster boundaries

What stays per-cluster:

  • Stateful workloads (DBs, Kafka): usually one cluster per replica set.
  • Logs/metrics: collect locally, ship out.
  • Secrets: rotate via central service.

What spans clusters:

  • App deployments via GitOps.
  • Global DNS.
  • Identity (OIDC).
  • Image registry.

Common mistakes

  • Single huge cluster “for simplicity” — blast radius hurts.
  • Synchronizing across regions without latency budget.
  • Stateful workloads spanning regions naively.
  • ArgoCD on one cluster managing many — single point of failure.
  • Mismatched K8s versions causing API drift.

Read this next

If you want my multi-cluster blueprint (ArgoCD + Velero + Crossplane), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .