Multi-cluster Kubernetes cheatsheet.
Why multi-cluster
- Regional / latency.
- Blast radius (per env, per team).
- Compliance (data residency).
- HA failover.
- Heterogeneous workloads (GPU clusters separate).
Strategies
- One cluster per env: simple, well-isolated.
- One cluster per region: latency / data sovereignty.
- Per team / per product: large orgs.
- Single huge cluster with namespaces: only when fewer clusters is genuinely better.
kubeconfig multi-cluster
# ~/.kube/config
clusters:
- name: dev
cluster: { server: https://dev.example.com, certificate-authority-data: ... }
- name: prod
cluster: { server: https://prod.example.com, certificate-authority-data: ... }
contexts:
- name: dev-admin
context: { cluster: dev, user: admin, namespace: default }
- name: prod-admin
context: { cluster: prod, user: admin }
users:
- name: admin
user: { token: ... }
kubectl config get-contexts
kubectl config use-context prod-admin
Or kubectx for fast switching.
ArgoCD ApplicationSet (cluster generator)
spec:
generators:
- clusters: {} # all registered
template:
metadata: { name: 'web-{{name}}' }
spec:
destination:
server: '{{server}}'
namespace: prod
Auto-creates Application per cluster.
Register cluster with ArgoCD
argocd cluster add my-context-name
ArgoCD installs a token-using SA in the target cluster.
Cluster API (CAPI)
Declarative cluster lifecycle (provision/upgrade/destroy clusters via K8s API):
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata: { name: my-cluster }
spec:
clusterNetwork: { pods: { cidrBlocks: [10.0.0.0/16] } }
infrastructureRef: { name: my-cluster, kind: AWSCluster, apiVersion: ... }
controlPlaneRef: { name: my-cluster, kind: KubeadmControlPlane, apiVersion: ... }
Providers: AWS, GCP, Azure, vSphere, Hetzner, etc.
Fleet (Rancher)
GitOps for many clusters:
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata: { name: my-app }
spec:
repo: https://github.com/me/manifests
targets:
- clusterGroup: prod
Multi-cluster service mesh
Cilium Cluster Mesh, Istio multi-cluster, Linkerd multi-cluster — extend services across clusters with cross-cluster DNS, mTLS, traffic shaping.
Submariner
VPN-less cross-cluster connectivity. Pods in cluster A can reach services in cluster B.
Global load balancing
- DNS-based: Route53 latency/geolocation, Cloudflare, NS1.
- Anycast: cloud LB anycast (GCP global LB).
- Application-level: CDN with origin failover.
Secrets propagation
- ExternalSecrets pulling from central Vault.
- Sealed Secrets with shared cluster cert (avoid).
- KMS-based encryption.
Image distribution
Use a regional registry mirror or content-addressed registry (Harbor with replication).
Backup across clusters (Velero)
helm install velero vmware-tanzu/velero \
--set provider=aws \
--set bucket=my-backup-bucket
velero backup create my-backup --include-namespaces=prod
velero restore create --from-backup=my-backup
Disaster recovery across clusters.
Crossplane
Manage cloud resources (DBs, queues, IAM) as K8s resources from any cluster:
apiVersion: dynamodb.aws.upbound.io/v1beta1
kind: Table
metadata: { name: my-table }
spec:
forProvider:
region: us-east-1
hashKey: id
attribute:
- { name: id, type: S }
readCapacity: 5
writeCapacity: 5
Monitoring federation
Single Grafana, multiple Prometheus per cluster, federated query (Thanos / Cortex / Mimir):
# Thanos sidecar to upload to S3
# Thanos Querier aggregates from all sidecars
Active-active vs active-passive
- Active-active: load split across clusters. Need stateful tier replication (cross-region DB replicas).
- Active-passive: one serves, other ready for failover. Simpler.
Networking patterns
- Service-mesh global mTLS.
- VPC peering for cross-region private links.
- Public LBs with health checks for failover.
Cluster boundaries
What stays per-cluster:
- Stateful workloads (DBs, Kafka): usually one cluster per replica set.
- Logs/metrics: collect locally, ship out.
- Secrets: rotate via central service.
What spans clusters:
- App deployments via GitOps.
- Global DNS.
- Identity (OIDC).
- Image registry.
Common mistakes
- Single huge cluster “for simplicity” — blast radius hurts.
- Synchronizing across regions without latency budget.
- Stateful workloads spanning regions naively.
- ArgoCD on one cluster managing many — single point of failure.
- Mismatched K8s versions causing API drift.
Read this next
If you want my multi-cluster blueprint (ArgoCD + Velero + Crossplane), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .