Observability cheatsheet.
kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kps prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
Bundles: Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics.
Expose Grafana
kubectl -n monitoring port-forward svc/kps-grafana 3000:80
# Default user: admin, password: prom-operator
ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata: { name: web, namespace: prod }
spec:
selector:
matchLabels: { app: web }
endpoints:
- port: metrics
path: /metrics
interval: 30s
Prometheus auto-discovers via ServiceMonitor.
PodMonitor
Same idea, scrape pods directly without service.
PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata: { name: web-rules, namespace: prod }
spec:
groups:
- name: web
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels: { severity: page }
annotations:
summary: "High 5xx error rate"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
labels: { severity: warning }
Common PromQL
# Rate of requests
sum(rate(http_requests_total[5m])) by (status)
# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# p95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage
sum(container_memory_working_set_bytes) by (pod)
# Pod restarts
sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod)
Alertmanager
# AlertmanagerConfig
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata: { name: my-config, namespace: monitoring }
spec:
route:
receiver: slack
routes:
- matchers: [{ name: severity, value: page }]
receiver: pagerduty
receivers:
- name: slack
slackConfigs:
- apiURL:
name: slack-url
key: url
channel: "#alerts"
- name: pagerduty
pagerdutyConfigs:
- serviceKey:
name: pd-key
key: key
Loki (logs)
helm install loki grafana/loki-stack -n logging --create-namespace \
--set promtail.enabled=true
Promtail / Vector / Fluent Bit ships logs to Loki. Grafana queries via LogQL.
{namespace="prod", app="web"} |= "error"
{namespace="prod"} | json | level="error"
rate({namespace="prod", app="web"}[5m])
Tempo (traces)
helm install tempo grafana/tempo -n tracing --create-namespace
OpenTelemetry collector ships traces:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata: { name: otel }
spec:
config: |
receivers:
otlp: { protocols: { grpc: {}, http: {} } }
exporters:
otlphttp:
endpoint: http://tempo:4318
service:
pipelines:
traces: { receivers: [otlp], exporters: [otlphttp] }
App instrumentation
Use OpenTelemetry SDK in your app. Auto-instrumentation available for most stacks.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")))
trace.set_tracer_provider(provider)
kubernetes-event-exporter
Ships K8s events to logs/Slack:
config.yaml:
route:
routes:
- match: [{ severity: warning }]
sinks: [slack]
sinks:
slack:
webhook: $SLACK_URL
OpenCost (cost monitoring)
helm install opencost opencost/opencost
Per-pod/namespace cost breakdown.
Dashboard imports
Pre-built Grafana dashboards by ID:
- 1860 — Node Exporter Full.
- 7249 — Kubernetes cluster.
- 14584 — ArgoCD.
- 12740 — Loki dashboard.
Karpenter / autoscaler events
# Karpenter pending pods
karpenter_pods_state{phase="Pending"}
# HPA scale events
kube_horizontalpodautoscaler_status_current_replicas
Recording rules
- record: namespace:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (namespace)
Pre-compute expensive queries.
SLOs with sloth
slos:
- name: web-availability
objective: 99.9
description: ...
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5.."}[5m]))
total_query: sum(rate(http_requests_total[5m]))
alerting:
page_alert: { labels: { severity: page } }
Auto-generates burn-rate alerts.
Common mistakes
- ServiceMonitor missing label match — not scraped.
- Cardinality explosion (high-cardinality labels).
- Loki retention too short — lose history.
- No alerts → silent failures.
- Loki / Tempo without object storage backend → data loss on pod restart.
Read this next
If you want my kube-prometheus-stack + Loki + Tempo bundle, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .