K8s production setup for 2026.
Cluster
Managed: EKS / GKE / AKS / DigitalOcean. Self-hosted: kubeadm, Talos, k3s.
Sizes:
- 3 control plane nodes (HA).
- 3+ worker nodes across AZs.
- Karpenter for elastic worker scaling.
Day-1 install order
1. CNI (Calico / Cilium)
2. cert-manager (TLS)
3. ingress-nginx (ingress)
4. external-dns (DNS automation)
5. external-secrets-operator (secrets from cloud)
6. kube-prometheus-stack (metrics + alerts)
7. loki + promtail (logs)
8. tempo + otel collector (traces)
9. ArgoCD (GitOps)
10. Karpenter / cluster-autoscaler (autoscaling)
11. velero (backup)
12. kyverno / OPA (policies)
13. metrics-server (HPA dependency)
14. CSI drivers (cloud disks)
ArgoCD App-of-Apps
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: { name: root, namespace: argocd }
spec:
destination: { server: https://kubernetes.default.svc, namespace: argocd }
source:
repoURL: https://github.com/me/cluster
path: bootstrap/
targetRevision: main
syncPolicy: { automated: { prune: true, selfHeal: true } }
bootstrap/ contains Applications for all infrastructure.
Per-namespace defaults
# psa-labels per namespace
labels:
pod-security.kubernetes.io/enforce: restricted
istio-injection: enabled # if using Istio
# default network policy: deny-all
# default resource quota + limit range
Workload template
apiVersion: apps/v1
kind: Deployment
metadata: { name: web, namespace: prod, labels: { app: web } }
spec:
replicas: 3
selector: { matchLabels: { app: web } }
template:
metadata: { labels: { app: web } }
spec:
serviceAccountName: web
securityContext:
runAsNonRoot: true
runAsUser: 1000
seccompProfile: { type: RuntimeDefault }
containers:
- name: web
image: ghcr.io/me/web:v1.2.3
ports: [{ containerPort: 8000, name: http }]
env:
- name: DATABASE_URL
valueFrom: { secretKeyRef: { name: web-secrets, key: DATABASE_URL } }
envFrom:
- configMapRef: { name: web-config }
resources:
requests: { cpu: 100m, memory: 128Mi }
limits: { cpu: 1, memory: 512Mi }
readinessProbe:
httpGet: { path: /health/ready, port: http }
periodSeconds: 5
livenessProbe:
httpGet: { path: /health, port: http }
periodSeconds: 30
startupProbe:
httpGet: { path: /health, port: http }
failureThreshold: 30
periodSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: [ALL] }
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: topology.kubernetes.io/zone
labelSelector: { matchLabels: { app: web } }
---
apiVersion: v1
kind: Service
metadata: { name: web, namespace: prod }
spec:
selector: { app: web }
ports: [{ port: 80, targetPort: http, name: http }]
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web
namespace: prod
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls: [{ hosts: [app.example.com], secretName: web-tls }]
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend: { service: { name: web, port: { number: 80 } } }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: web, namespace: prod }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: web }
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: web, namespace: prod }
spec:
minAvailable: 2
selector: { matchLabels: { app: web } }
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: web, namespace: prod }
spec:
podSelector: { matchLabels: { app: web } }
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector: { matchLabels: { kubernetes.io/metadata.name: ingress-nginx } }
ports: [{ port: 8000 }]
egress:
- to:
- podSelector: { matchLabels: { app: db } }
ports: [{ port: 5432 }]
- ports: [{ port: 53, protocol: UDP }]
Backups
# Velero schedule
apiVersion: velero.io/v1
kind: Schedule
metadata: { name: nightly, namespace: velero }
spec:
schedule: "0 2 * * *"
template:
includedNamespaces: ["prod", "team-*"]
ttl: 720h0m0s # 30 days
Monitoring (PrometheusRule)
groups:
- name: web
rules:
- alert: WebHighErrorRate
expr: |
sum(rate(http_requests_total{namespace="prod", app="web", status=~"5.."}[5m]))
/ sum(rate(http_requests_total{namespace="prod", app="web"}[5m])) > 0.05
for: 5m
labels: { severity: page }
- alert: WebHighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{namespace="prod"}[5m])) > 1
for: 10m
labels: { severity: warning }
Health checklist
- All workloads have resource requests + limits.
- All have liveness + readiness + startup probes.
- PDB for every multi-replica Deployment.
- HPA where load varies.
- NetworkPolicy default-deny + per-app allow.
- PSA restricted on workload namespaces.
- Backups configured + restore tested.
- Monitoring + alerting wired up.
- ArgoCD with App-of-Apps.
- Image signing + policy.
- Secrets via external-secrets-operator.
- Image scanning in CI.
- Logs shipping to Loki / similar.
Read this next
That’s 20 Kubernetes cheatsheets. Next category: Nginx.
If you want my full prod K8s blueprint, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .