Kubernetes debugging is a skill — half mechanics, half pattern recognition. The mechanics rarely change; the patterns do. This post is the working set for K8s 1.30+.
The five-second checks
kubectl get pods -A | grep -v Running
kubectl describe pod <pod>
kubectl logs <pod> -p # previous instance (CrashLoop)
kubectl logs <pod> -f
kubectl get events --sort-by='.lastTimestamp' | tail -30
90% of pod issues surface here. Don’t skip to the fancy tools first.
Common failures
| Symptom | Common cause |
|---|---|
ImagePullBackOff | Wrong image / no creds for registry |
ErrImagePull | Same |
CreateContainerConfigError | ConfigMap / Secret missing |
CrashLoopBackOff | App crashes; check logs –previous |
Pending | No node has resources; describe pod |
OOMKilled | Memory limit too low |
Unknown | Node unreachable |
Ephemeral debug containers
kubectl debug -it mypod --image=busybox --target=app
Adds a container in the same pod / network / process namespace. You get shell + tools without modifying the original.
For network tools:
kubectl debug -it mypod --image=nicolaka/netshoot --target=app
# Now: nslookup, dig, curl, tcpdump, iperf, etc.
The nicolaka/netshoot image is the de-facto network-debug Swiss Army knife.
Network debugging
# DNS resolution from inside the pod
kubectl exec -it mypod -- nslookup other-service
# HTTP from inside
kubectl exec -it mypod -- curl -v http://other-service:8080/health
# Latency / connectivity
kubectl exec -it mypod -- wget -qO- http://other-service:8080
Most “service is down” tickets are DNS or NetworkPolicy. Verify resolution first.
Logs at scale
kubectl logs is fine for one pod. For many:
# stern — multi-pod live tail
stern -n production "api-server.*"
# Or via Loki / Grafana
{namespace="prod", app="api"} |= "ERROR"
For prod debugging: dashboards in Loki/Grafana with saved queries beat ad-hoc CLI. See Observability .
Resource issues
kubectl top pods -n production
kubectl top nodes
OOMKilled? Check kubectl describe pod for the exit code (137). Bump memory limit; investigate leak. Same for CPU throttling.
kubectl get pod mypod -o yaml | grep -A5 resources
Limit too low → throttle / OOM. Limit too high → wasted capacity.
Port-forward for local poking
kubectl port-forward svc/api 8080:80
# Now hit localhost:8080
For tools that don’t speak K8s natively (psql, redis-cli, browsers).
tcpdump in a pod
kubectl debug -it mypod --image=nicolaka/netshoot --target=app
# inside:
tcpdump -i any -nn -s 0 -w /tmp/cap.pcap port 5432
# Copy out:
kubectl cp pod/mypod:/tmp/cap.pcap ./cap.pcap -c debugger
Open in Wireshark. Useful for “this connection just hangs” mysteries.
Service mesh debugging
If you’re on Istio / Linkerd / Cilium:
istioctl proxy-config cluster mypod
istioctl proxy-config endpoint mypod
istioctl analyze
The mesh adds complexity. Verify traffic actually flows through the mesh; check sidecar logs:
kubectl logs mypod -c istio-proxy
Liveness / readiness probes
Misconfigured probes cause restarts:
livenessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
If your app takes 60s to start, initialDelaySeconds: 10 will kill it before it’s ready. Use startupProbe for slow-starting apps.
Persistent volume issues
kubectl get pv,pvc -A
kubectl describe pvc <pvc> # events tell you why
Pending PVC: storage class issue, no available volume, AZ mismatch.
kubectl debug node
kubectl debug node/mynode -it --image=busybox
# Now you have a pod with hostfs mounted at /host
Inspect the kubelet, container runtime, host filesystem.
k9s
Terminal UI: k9s. Live pod list, log streaming, exec, describe, port-forward — all keyboard-driven. Steeper learning curve than kubectl; faster once you know it.
kubeshark
Wireshark for K8s. Captures and decodes traffic between pods. Brilliant for debugging service-to-service issues, slow APIs, weird TLS errors. Run on-demand, not always-on (heavy).
Common mistakes
1. Editing live resources
kubectl edit deploy ... to fix prod, then forget to commit to git. Drift. Use GitOps; see GitOps with Argo CD
.
2. Killing pods to “fix” things
kubectl delete pod ... when CrashLooping. Maybe fixes; maybe loses state; doesn’t address root cause.
3. No timeouts in probes
Probe waits forever; pod is “alive” but actually hung. Set timeouts.
4. Reading the wrong logs
Init container failed; you’re looking at app logs. kubectl logs <pod> -c <container>.
5. Ignoring events
kubectl describe pod events tell you exactly why scheduling/pulling/starting failed. Read them.
What I’d ship today
For new K8s clusters:
- Loki + Grafana for logs at scale.
- kube-prometheus-stack for metrics.
- Tempo for traces.
- Argo CD for GitOps; never edit live.
- k9s + stern + netshoot as standard team tooling.
- Network policies so debugging surfaces real network issues.
- Runbooks for top-N alert types.
Read this next
- GitOps with Argo CD 2026
- Observability Stack 2026
- Incident Response 2026
- Service Mesh 2026 — Istio, Linkerd, Cilium
If you want my K8s debugging cheat sheet + netshoot manifests, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .