Observability cheatsheet.

kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kps prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

Bundles: Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics.

Expose Grafana

kubectl -n monitoring port-forward svc/kps-grafana 3000:80
# Default user: admin, password: prom-operator

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata: { name: web, namespace: prod }
spec:
  selector:
    matchLabels: { app: web }
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Prometheus auto-discovers via ServiceMonitor.

PodMonitor

Same idea, scrape pods directly without service.

PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata: { name: web-rules, namespace: prod }
spec:
  groups:
    - name: web
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
          for: 5m
          labels: { severity: page }
          annotations:
            summary: "High 5xx error rate"
        
        - alert: HighLatency
          expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
          for: 10m
          labels: { severity: warning }

Common PromQL

# Rate of requests
sum(rate(http_requests_total[5m])) by (status)

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# p95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage
sum(container_memory_working_set_bytes) by (pod)

# Pod restarts
sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod)

Alertmanager

# AlertmanagerConfig
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata: { name: my-config, namespace: monitoring }
spec:
  route:
    receiver: slack
    routes:
      - matchers: [{ name: severity, value: page }]
        receiver: pagerduty
  receivers:
    - name: slack
      slackConfigs:
        - apiURL:
            name: slack-url
            key: url
          channel: "#alerts"
    - name: pagerduty
      pagerdutyConfigs:
        - serviceKey:
            name: pd-key
            key: key

Loki (logs)

helm install loki grafana/loki-stack -n logging --create-namespace \
  --set promtail.enabled=true

Promtail / Vector / Fluent Bit ships logs to Loki. Grafana queries via LogQL.

{namespace="prod", app="web"} |= "error"
{namespace="prod"} | json | level="error"
rate({namespace="prod", app="web"}[5m])

Tempo (traces)

helm install tempo grafana/tempo -n tracing --create-namespace

OpenTelemetry collector ships traces:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata: { name: otel }
spec:
  config: |
    receivers:
      otlp: { protocols: { grpc: {}, http: {} } }
    exporters:
      otlphttp:
        endpoint: http://tempo:4318
    service:
      pipelines:
        traces: { receivers: [otlp], exporters: [otlphttp] }

App instrumentation

Use OpenTelemetry SDK in your app. Auto-instrumentation available for most stacks.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")))
trace.set_tracer_provider(provider)

kubernetes-event-exporter

Ships K8s events to logs/Slack:

config.yaml:
  route:
    routes:
      - match: [{ severity: warning }]
        sinks: [slack]
  sinks:
    slack:
      webhook: $SLACK_URL

OpenCost (cost monitoring)

helm install opencost opencost/opencost

Per-pod/namespace cost breakdown.

Dashboard imports

Pre-built Grafana dashboards by ID:

  • 1860 — Node Exporter Full.
  • 7249 — Kubernetes cluster.
  • 14584 — ArgoCD.
  • 12740 — Loki dashboard.

Karpenter / autoscaler events

# Karpenter pending pods
karpenter_pods_state{phase="Pending"}

# HPA scale events
kube_horizontalpodautoscaler_status_current_replicas

Recording rules

- record: namespace:http_requests:rate5m
  expr: sum(rate(http_requests_total[5m])) by (namespace)

Pre-compute expensive queries.

SLOs with sloth

slos:
  - name: web-availability
    objective: 99.9
    description: ...
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[5m]))
        total_query: sum(rate(http_requests_total[5m]))
    alerting:
      page_alert: { labels: { severity: page } }

Auto-generates burn-rate alerts.

Common mistakes

  • ServiceMonitor missing label match — not scraped.
  • Cardinality explosion (high-cardinality labels).
  • Loki retention too short — lose history.
  • No alerts → silent failures.
  • Loki / Tempo without object storage backend → data loss on pod restart.

Read this next

If you want my kube-prometheus-stack + Loki + Tempo bundle, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .