Kubernetes resource limits decide whether your service is fast, gets killed, or starves its neighbors. Most teams guess; the cost of guessing wrong shows up as 3am pages. This post is the working playbook.
requests vs limits
resources:
requests: # what the scheduler reserves; min guarantee
cpu: "500m"
memory: "512Mi"
limits: # hard ceiling; pod killed if exceeded
cpu: "1"
memory: "1Gi"
- Requests = what you’re guaranteed; how much capacity the scheduler reserves.
- Limits = the cap; over memory limit → OOMKilled; over CPU limit → throttled.
OOMKill
Container exceeds memory limit → kernel kills it. Fast and brutal.
kubectl describe pod
# Last State: Terminated, Reason: OOMKilled, Exit Code: 137
Mitigation:
- Set memory limit higher (with monitoring data).
- Find the leak (heap dumps, profiling).
- Add HPA for horizontal scaling.
Memory limits are necessary — without them, one pod’s leak takes down the node. See Kubernetes Debugging .
CPU throttling
Container hits CPU limit → kernel throttles. Latency spikes; no kill.
container_cpu_cfs_throttled_seconds_total
Throttling shows up in latency metrics, not in errors. Easy to miss.
The “no CPU limits” debate
Setting CPU limits often hurts more than helps:
- Bursty workloads can use spare cycles when limits aren’t set.
- With limits, requests bound CPU even when nodes have headroom.
- Latency-sensitive apps suffer throttling at limits.
Common production pattern: set CPU requests (fairness / scheduling) but no CPU limits. Let the workload burst when there’s slack.
This requires good capacity planning — node CPU never fully saturated.
For multi-tenant clusters or cost control: keep CPU limits but at headroom multiples (2–3x request).
Memory limits — always set
Unlike CPU, memory has no “pause” mode. Without limits, one pod can consume all node memory and OOM-kill its neighbors. Always set memory limits.
QoS classes
| QoS | When |
|---|---|
| Guaranteed | requests == limits for both CPU and memory |
| Burstable | requests set, limits set higher (or only memory limit set) |
| BestEffort | no requests / limits |
When a node is under memory pressure, eviction order:
- BestEffort first.
- Burstable next.
- Guaranteed last.
Production: aim for Guaranteed or Burstable with sensible limits.
Sizing methodology
1. Deploy with NO limits (or generous limits) to staging / canary.
2. Run representative load for a week.
3. Measure: p50, p95, p99 CPU and memory.
4. Set requests = p95 + 20% headroom.
5. Set memory limit = observed peak * 1.5.
6. Watch for throttling / OOM in production.
7. Iterate quarterly.
Without measurement: you’re guessing.
VPA (Vertical Pod Autoscaler)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: api-vpa }
spec:
targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
updatePolicy: { updateMode: "Auto" }
VPA observes usage; recommends or auto-sets requests/limits. Runs alongside HPA (use HPA for horizontal scaling, VPA for vertical sizing).
Goldilocks
helm install goldilocks fairwinds/goldilocks
Web UI showing recommended requests/limits per workload from VPA observations. Quick way to find oversized / undersized pods.
HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
Scale on CPU utilization (relative to requests). For better signals: scale on RPS, queue depth, or custom metrics via KEDA.
Common mistakes
1. Copy-paste limits from another service
Different workloads have different shapes. Measure each.
2. Memory limit too low → OOMKill
Pod restart loop. Increase memory limit; investigate why.
3. CPU limit too low → throttling
Latency spikes; users complain. Either remove the CPU limit or raise it.
4. No requests set
Scheduler doesn’t reserve resources. Pods can run anywhere; node packing erratic.
5. requests = limits for everything
“Always Guaranteed” means oversizing. Burstable + monitoring is often more cost-effective.
Capacity planning
Total cluster capacity = sum of node capacities
Allocatable per node ≈ 90% of node total (kubelet, OS overhead)
Bin-packing efficiency ≈ 70–85% in practice
Reserve headroom ≈ 20–30% for spikes / failover
Usable for workloads ≈ 60–70% of raw capacity
Don’t size at 100%. Save 25%+ for autoscaling, failures, deploys.
Cost optimization
| Cost lever | |
|---|---|
| Right-size requests | Avoid wasted reserve (Goldilocks) |
| Spot instances | 70% off; tolerant workloads |
| Bin-pack | Larger nodes; better utilization |
| HPA | Scale down off-peak |
| Karpenter | Just-in-time node provisioning |
Karpenter has largely replaced cluster autoscaler in 2026.
Pod priority and preemption
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: { name: high-priority }
value: 1000000
globalDefault: false
description: "User-facing services"
Critical workloads can preempt low-priority ones during pressure. For mixed workloads (batch + serving): set priorities.
Real-world example
A team I worked with had a Python API:
- Memory request: 2Gi (guess).
- Actual p95: 600Mi.
- Cost waste: 1.4Gi × N replicas reserved unused.
After Goldilocks + observation:
- Memory request: 800Mi.
- Memory limit: 1.5Gi (peak * 1.5).
- Saved ~40% capacity in the cluster.
Sizing matters at scale.
What I’d ship today
For a new K8s deployment:
- Memory limits always.
- CPU requests, no CPU limits (unless multi-tenant fairness required).
- VPA + Goldilocks to find right size.
- HPA on CPU or custom metrics.
- Karpenter for node autoscaling.
- Monitoring for OOM and CPU throttling.
- Quarterly sizing review.
Read this next
If you want my K8s sizing playbook + Goldilocks dashboards, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .