Kubernetes resource limits decide whether your service is fast, gets killed, or starves its neighbors. Most teams guess; the cost of guessing wrong shows up as 3am pages. This post is the working playbook.

requests vs limits

resources:
  requests:        # what the scheduler reserves; min guarantee
    cpu: "500m"
    memory: "512Mi"
  limits:          # hard ceiling; pod killed if exceeded
    cpu: "1"
    memory: "1Gi"
  • Requests = what you’re guaranteed; how much capacity the scheduler reserves.
  • Limits = the cap; over memory limit → OOMKilled; over CPU limit → throttled.

OOMKill

Container exceeds memory limit → kernel kills it. Fast and brutal.

kubectl describe pod
# Last State: Terminated, Reason: OOMKilled, Exit Code: 137

Mitigation:

  • Set memory limit higher (with monitoring data).
  • Find the leak (heap dumps, profiling).
  • Add HPA for horizontal scaling.

Memory limits are necessary — without them, one pod’s leak takes down the node. See Kubernetes Debugging .

CPU throttling

Container hits CPU limit → kernel throttles. Latency spikes; no kill.

container_cpu_cfs_throttled_seconds_total

Throttling shows up in latency metrics, not in errors. Easy to miss.

The “no CPU limits” debate

Setting CPU limits often hurts more than helps:

  • Bursty workloads can use spare cycles when limits aren’t set.
  • With limits, requests bound CPU even when nodes have headroom.
  • Latency-sensitive apps suffer throttling at limits.

Common production pattern: set CPU requests (fairness / scheduling) but no CPU limits. Let the workload burst when there’s slack.

This requires good capacity planning — node CPU never fully saturated.

For multi-tenant clusters or cost control: keep CPU limits but at headroom multiples (2–3x request).

Memory limits — always set

Unlike CPU, memory has no “pause” mode. Without limits, one pod can consume all node memory and OOM-kill its neighbors. Always set memory limits.

QoS classes

QoSWhen
Guaranteedrequests == limits for both CPU and memory
Burstablerequests set, limits set higher (or only memory limit set)
BestEffortno requests / limits

When a node is under memory pressure, eviction order:

  1. BestEffort first.
  2. Burstable next.
  3. Guaranteed last.

Production: aim for Guaranteed or Burstable with sensible limits.

Sizing methodology

1. Deploy with NO limits (or generous limits) to staging / canary.
2. Run representative load for a week.
3. Measure: p50, p95, p99 CPU and memory.
4. Set requests = p95 + 20% headroom.
5. Set memory limit = observed peak * 1.5.
6. Watch for throttling / OOM in production.
7. Iterate quarterly.

Without measurement: you’re guessing.

VPA (Vertical Pod Autoscaler)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: api-vpa }
spec:
  targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  updatePolicy: { updateMode: "Auto" }

VPA observes usage; recommends or auto-sets requests/limits. Runs alongside HPA (use HPA for horizontal scaling, VPA for vertical sizing).

Goldilocks

helm install goldilocks fairwinds/goldilocks

Web UI showing recommended requests/limits per workload from VPA observations. Quick way to find oversized / undersized pods.

HPA (Horizontal Pod Autoscaler)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Scale on CPU utilization (relative to requests). For better signals: scale on RPS, queue depth, or custom metrics via KEDA.

Common mistakes

1. Copy-paste limits from another service

Different workloads have different shapes. Measure each.

2. Memory limit too low → OOMKill

Pod restart loop. Increase memory limit; investigate why.

3. CPU limit too low → throttling

Latency spikes; users complain. Either remove the CPU limit or raise it.

4. No requests set

Scheduler doesn’t reserve resources. Pods can run anywhere; node packing erratic.

5. requests = limits for everything

“Always Guaranteed” means oversizing. Burstable + monitoring is often more cost-effective.

Capacity planning

Total cluster capacity = sum of node capacities
Allocatable per node  90% of node total (kubelet, OS overhead)
Bin-packing efficiency  7085% in practice
Reserve headroom  2030% for spikes / failover
Usable for workloads  6070% of raw capacity

Don’t size at 100%. Save 25%+ for autoscaling, failures, deploys.

Cost optimization

Cost lever
Right-size requestsAvoid wasted reserve (Goldilocks)
Spot instances70% off; tolerant workloads
Bin-packLarger nodes; better utilization
HPAScale down off-peak
KarpenterJust-in-time node provisioning

Karpenter has largely replaced cluster autoscaler in 2026.

Pod priority and preemption

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: { name: high-priority }
value: 1000000
globalDefault: false
description: "User-facing services"

Critical workloads can preempt low-priority ones during pressure. For mixed workloads (batch + serving): set priorities.

Real-world example

A team I worked with had a Python API:

  • Memory request: 2Gi (guess).
  • Actual p95: 600Mi.
  • Cost waste: 1.4Gi × N replicas reserved unused.

After Goldilocks + observation:

  • Memory request: 800Mi.
  • Memory limit: 1.5Gi (peak * 1.5).
  • Saved ~40% capacity in the cluster.

Sizing matters at scale.

What I’d ship today

For a new K8s deployment:

  • Memory limits always.
  • CPU requests, no CPU limits (unless multi-tenant fairness required).
  • VPA + Goldilocks to find right size.
  • HPA on CPU or custom metrics.
  • Karpenter for node autoscaling.
  • Monitoring for OOM and CPU throttling.
  • Quarterly sizing review.

Read this next

If you want my K8s sizing playbook + Goldilocks dashboards, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .