Kubernetes Cheatsheet 17 — Stateful Workloads

Stateful workloads in K8s cheatsheet.

Should you run a database in K8s?

Pros:

Single deployment pattern.
Operators handle complex ops.
GitOps for DB too.

Cons:

More moving parts.
Storage perf depends on CSI driver.
Backups + failover require care.

Rule of thumb: small/mid scale and a good operator → yes. Massive scale or compliance constraints → consider managed (RDS, Cloud SQL).

Postgres operators

CloudNativePG: simple, well-maintained.
Zalando: mature.
Crunchy Postgres: full-featured.

CloudNativePG example

kubectl apply -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.23/releases/cnpg-1.23.0.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata: { name: pg }
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 20Gi
    storageClass: gp3
  backup:
    barmanObjectStore:
      destinationPath: s3://my-bucket/pg
      s3Credentials: { ... }
    retentionPolicy: "30d"
  monitoring: { enablePodMonitor: true }

Includes streaming replication, failover, backup, monitoring.

Redis (Bitnami chart)

helm install redis bitnami/redis \
  --set auth.password=x \
  --set replica.replicaCount=3 \
  --set master.persistence.size=8Gi \
  --set sentinel.enabled=true

Or use redis-operator for advanced setups.

Kafka

Strimzi: official, mature.

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata: { name: my-cluster }
spec:
  kafka:
    version: 3.7.0
    replicas: 3
    listeners:
      - { name: plain, port: 9092, type: internal, tls: false }
      - { name: tls, port: 9093, type: internal, tls: true }
    storage:
      type: persistent-claim
      size: 100Gi
  zookeeper:
    replicas: 3
    storage: { type: persistent-claim, size: 10Gi }

Modern Kafka (KRaft) — no zookeeper.

Elasticsearch / OpenSearch

ECK (Elastic Cloud on Kubernetes) operator. Resource-hungry; consider managed if budget allows.

MongoDB

MongoDB Community Operator: free.
Percona MongoDB Operator.

RabbitMQ

helm install rabbit bitnami/rabbitmq --set auth.password=x

Or rabbitmq-cluster-operator.

Volumes for state

RWO is fine for primary+replica DBs (each replica gets own PVC).
Use cloud-native SSD-class storage.
WaitForFirstConsumer binding mode (zone-aware).

Backups

Native operators usually have built-in backups (CloudNativePG → S3 via Barman). For others:

Velero: cluster-wide backup including PVs (via snapshots or restic).
App-level: pg_dump, mongodump to S3 via CronJob.

StatefulSet basics

See cheatsheet 07. Operators usually create StatefulSets under the hood.

Anti-affinity for HA

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector: { matchLabels: { app: pg } }
        topologyKey: kubernetes.io/hostname

Spread replicas across nodes (and ideally AZs).

Disruption budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: pg }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: pg } }

Prevent autoscaler/drain from taking down all replicas.

Connection pooling

PgBouncer / Odyssey in front of Postgres:

# CloudNativePG built-in
spec:
  pooler:
    instances: 3
    type: rw
    pgbouncer: { poolMode: transaction }

Operator pattern

Operators encode operational knowledge:

Detect primary failure → promote replica.
Schedule backups.
Rolling upgrade with consistency.

Anti-pattern: roll your own Postgres StatefulSet for prod.

When to use managed

Compliance: SOC 2, HIPAA, etc.
Massive scale.
Limited ops bandwidth.
High-availability SLAs.

Hybrid: app + caching in K8s, DBs managed (RDS/Cloud SQL).

Common mistakes

Single PVC for primary + replica (RWO can’t).
No anti-affinity → all replicas on one node → SPOF.
No backups, or backups never tested.
Updating operator version without reading release notes (data migration).
Heavy I/O on slow storage class.

Should you run a database in K8s?#

Postgres operators#

CloudNativePG example#

Redis (Bitnami chart)#

Kafka#

Elasticsearch / OpenSearch#

MongoDB#

RabbitMQ#

Volumes for state#

Backups#

StatefulSet basics#

Anti-affinity for HA#

Disruption budgets#

Connection pooling#

Operator pattern#

When to use managed#

Common mistakes#

Read this next#