Kubernetes debugging is a skill — half mechanics, half pattern recognition. The mechanics rarely change; the patterns do. This post is the working set for K8s 1.30+.

The five-second checks

kubectl get pods -A | grep -v Running
kubectl describe pod <pod>
kubectl logs <pod> -p   # previous instance (CrashLoop)
kubectl logs <pod> -f
kubectl get events --sort-by='.lastTimestamp' | tail -30

90% of pod issues surface here. Don’t skip to the fancy tools first.

Common failures

SymptomCommon cause
ImagePullBackOffWrong image / no creds for registry
ErrImagePullSame
CreateContainerConfigErrorConfigMap / Secret missing
CrashLoopBackOffApp crashes; check logs –previous
PendingNo node has resources; describe pod
OOMKilledMemory limit too low
UnknownNode unreachable

Ephemeral debug containers

kubectl debug -it mypod --image=busybox --target=app

Adds a container in the same pod / network / process namespace. You get shell + tools without modifying the original.

For network tools:

kubectl debug -it mypod --image=nicolaka/netshoot --target=app
# Now: nslookup, dig, curl, tcpdump, iperf, etc.

The nicolaka/netshoot image is the de-facto network-debug Swiss Army knife.

Network debugging

# DNS resolution from inside the pod
kubectl exec -it mypod -- nslookup other-service

# HTTP from inside
kubectl exec -it mypod -- curl -v http://other-service:8080/health

# Latency / connectivity
kubectl exec -it mypod -- wget -qO- http://other-service:8080

Most “service is down” tickets are DNS or NetworkPolicy. Verify resolution first.

Logs at scale

kubectl logs is fine for one pod. For many:

# stern — multi-pod live tail
stern -n production "api-server.*"

# Or via Loki / Grafana
{namespace="prod", app="api"} |= "ERROR"

For prod debugging: dashboards in Loki/Grafana with saved queries beat ad-hoc CLI. See Observability .

Resource issues

kubectl top pods -n production
kubectl top nodes

OOMKilled? Check kubectl describe pod for the exit code (137). Bump memory limit; investigate leak. Same for CPU throttling.

kubectl get pod mypod -o yaml | grep -A5 resources

Limit too low → throttle / OOM. Limit too high → wasted capacity.

Port-forward for local poking

kubectl port-forward svc/api 8080:80
# Now hit localhost:8080

For tools that don’t speak K8s natively (psql, redis-cli, browsers).

tcpdump in a pod

kubectl debug -it mypod --image=nicolaka/netshoot --target=app
# inside:
tcpdump -i any -nn -s 0 -w /tmp/cap.pcap port 5432
# Copy out:
kubectl cp pod/mypod:/tmp/cap.pcap ./cap.pcap -c debugger

Open in Wireshark. Useful for “this connection just hangs” mysteries.

Service mesh debugging

If you’re on Istio / Linkerd / Cilium:

istioctl proxy-config cluster mypod
istioctl proxy-config endpoint mypod
istioctl analyze

The mesh adds complexity. Verify traffic actually flows through the mesh; check sidecar logs:

kubectl logs mypod -c istio-proxy

Liveness / readiness probes

Misconfigured probes cause restarts:

livenessProbe:
  httpGet: { path: /health, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

If your app takes 60s to start, initialDelaySeconds: 10 will kill it before it’s ready. Use startupProbe for slow-starting apps.

Persistent volume issues

kubectl get pv,pvc -A
kubectl describe pvc <pvc>  # events tell you why

Pending PVC: storage class issue, no available volume, AZ mismatch.

kubectl debug node

kubectl debug node/mynode -it --image=busybox
# Now you have a pod with hostfs mounted at /host

Inspect the kubelet, container runtime, host filesystem.

k9s

Terminal UI: k9s. Live pod list, log streaming, exec, describe, port-forward — all keyboard-driven. Steeper learning curve than kubectl; faster once you know it.

kubeshark

Wireshark for K8s. Captures and decodes traffic between pods. Brilliant for debugging service-to-service issues, slow APIs, weird TLS errors. Run on-demand, not always-on (heavy).

Common mistakes

1. Editing live resources

kubectl edit deploy ... to fix prod, then forget to commit to git. Drift. Use GitOps; see GitOps with Argo CD .

2. Killing pods to “fix” things

kubectl delete pod ... when CrashLooping. Maybe fixes; maybe loses state; doesn’t address root cause.

3. No timeouts in probes

Probe waits forever; pod is “alive” but actually hung. Set timeouts.

4. Reading the wrong logs

Init container failed; you’re looking at app logs. kubectl logs <pod> -c <container>.

5. Ignoring events

kubectl describe pod events tell you exactly why scheduling/pulling/starting failed. Read them.

What I’d ship today

For new K8s clusters:

  • Loki + Grafana for logs at scale.
  • kube-prometheus-stack for metrics.
  • Tempo for traces.
  • Argo CD for GitOps; never edit live.
  • k9s + stern + netshoot as standard team tooling.
  • Network policies so debugging surfaces real network issues.
  • Runbooks for top-N alert types.

Read this next

If you want my K8s debugging cheat sheet + netshoot manifests, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .