Kubernetes Cheatsheet 20 — Production Setup

K8s production setup for 2026.

Cluster

Managed: EKS / GKE / AKS / DigitalOcean. Self-hosted: kubeadm, Talos, k3s.

Sizes:

3 control plane nodes (HA).
3+ worker nodes across AZs.
Karpenter for elastic worker scaling.

Day-1 install order

1. CNI                         (Calico / Cilium)
2. cert-manager                (TLS)
3. ingress-nginx               (ingress)
4. external-dns                (DNS automation)
5. external-secrets-operator   (secrets from cloud)
6. kube-prometheus-stack       (metrics + alerts)
7. loki + promtail             (logs)
8. tempo + otel collector      (traces)
9. ArgoCD                      (GitOps)
10. Karpenter / cluster-autoscaler  (autoscaling)
11. velero                     (backup)
12. kyverno / OPA              (policies)
13. metrics-server             (HPA dependency)
14. CSI drivers                (cloud disks)

ArgoCD App-of-Apps

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: { name: root, namespace: argocd }
spec:
  destination: { server: https://kubernetes.default.svc, namespace: argocd }
  source:
    repoURL: https://github.com/me/cluster
    path: bootstrap/
    targetRevision: main
  syncPolicy: { automated: { prune: true, selfHeal: true } }

bootstrap/ contains Applications for all infrastructure.

Per-namespace defaults

# psa-labels per namespace
labels:
  pod-security.kubernetes.io/enforce: restricted
  istio-injection: enabled               # if using Istio

# default network policy: deny-all
# default resource quota + limit range

Workload template

apiVersion: apps/v1
kind: Deployment
metadata: { name: web, namespace: prod, labels: { app: web } }
spec:
  replicas: 3
  selector: { matchLabels: { app: web } }
  template:
    metadata: { labels: { app: web } }
    spec:
      serviceAccountName: web
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: web
          image: ghcr.io/me/web:v1.2.3
          ports: [{ containerPort: 8000, name: http }]
          env:
            - name: DATABASE_URL
              valueFrom: { secretKeyRef: { name: web-secrets, key: DATABASE_URL } }
          envFrom:
            - configMapRef: { name: web-config }
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits: { cpu: 1, memory: 512Mi }
          readinessProbe:
            httpGet: { path: /health/ready, port: http }
            periodSeconds: 5
          livenessProbe:
            httpGet: { path: /health, port: http }
            periodSeconds: 30
          startupProbe:
            httpGet: { path: /health, port: http }
            failureThreshold: 30
            periodSeconds: 5
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: [ALL] }
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: topology.kubernetes.io/zone
                labelSelector: { matchLabels: { app: web } }
---
apiVersion: v1
kind: Service
metadata: { name: web, namespace: prod }
spec:
  selector: { app: web }
  ports: [{ port: 80, targetPort: http, name: http }]
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web
  namespace: prod
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls: [{ hosts: [app.example.com], secretName: web-tls }]
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend: { service: { name: web, port: { number: 80 } } }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: web, namespace: prod }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: web }
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: web, namespace: prod }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: web } }
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: web, namespace: prod }
spec:
  podSelector: { matchLabels: { app: web } }
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { kubernetes.io/metadata.name: ingress-nginx } }
      ports: [{ port: 8000 }]
  egress:
    - to:
        - podSelector: { matchLabels: { app: db } }
      ports: [{ port: 5432 }]
    - ports: [{ port: 53, protocol: UDP }]

Backups

# Velero schedule
apiVersion: velero.io/v1
kind: Schedule
metadata: { name: nightly, namespace: velero }
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces: ["prod", "team-*"]
    ttl: 720h0m0s        # 30 days

Monitoring (PrometheusRule)

groups:
  - name: web
    rules:
      - alert: WebHighErrorRate
        expr: |
          sum(rate(http_requests_total{namespace="prod", app="web", status=~"5.."}[5m]))
            / sum(rate(http_requests_total{namespace="prod", app="web"}[5m])) > 0.05
        for: 5m
        labels: { severity: page }
      
      - alert: WebHighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{namespace="prod"}[5m])) > 1
        for: 10m
        labels: { severity: warning }

Health checklist

All workloads have resource requests + limits.
All have liveness + readiness + startup probes.
PDB for every multi-replica Deployment.
HPA where load varies.
NetworkPolicy default-deny + per-app allow.
PSA restricted on workload namespaces.
Backups configured + restore tested.
Monitoring + alerting wired up.
ArgoCD with App-of-Apps.
Image signing + policy.
Secrets via external-secrets-operator.
Image scanning in CI.
Logs shipping to Loki / similar.

Cluster#

Day-1 install order#

ArgoCD App-of-Apps#

Per-namespace defaults#

Workload template#

Backups#

Monitoring (PrometheusRule)#

Health checklist#

Read this next#

Cluster

Day-1 install order

ArgoCD App-of-Apps

Per-namespace defaults

Workload template

Backups

Monitoring (PrometheusRule)

Health checklist

Read this next