Should I set CPU limits?

Controversial. Many SREs argue NO CPU limits — just requests + Guaranteed QoS — to avoid throttling under bursty load. Set memory limits always. CPU: depends on tenancy / fairness needs.

How do I size requests / limits?

Measure first. Run with no limits; observe p50, p95, p99 usage over a representative period. Set requests at p95 actual + headroom; memory limit at observed peak + buffer.

Kubernetes Resource Limits in 2026 — CPU, Memory, and the Cost of Getting It Wrong

Kubernetes resource limits decide whether your service is fast, gets killed, or starves its neighbors. Most teams guess; the cost of guessing wrong shows up as 3am pages. This post is the working playbook.

requests vs limits

resources:
  requests:        # what the scheduler reserves; min guarantee
    cpu: "500m"
    memory: "512Mi"
  limits:          # hard ceiling; pod killed if exceeded
    cpu: "1"
    memory: "1Gi"

Requests = what you’re guaranteed; how much capacity the scheduler reserves.
Limits = the cap; over memory limit → OOMKilled; over CPU limit → throttled.

OOMKill

Container exceeds memory limit → kernel kills it. Fast and brutal.

kubectl describe pod
# Last State: Terminated, Reason: OOMKilled, Exit Code: 137

Mitigation:

Set memory limit higher (with monitoring data).
Find the leak (heap dumps, profiling).
Add HPA for horizontal scaling.

Memory limits are necessary — without them, one pod’s leak takes down the node. See Kubernetes Debugging .

CPU throttling

Container hits CPU limit → kernel throttles. Latency spikes; no kill.

container_cpu_cfs_throttled_seconds_total

Throttling shows up in latency metrics, not in errors. Easy to miss.

The “no CPU limits” debate

Setting CPU limits often hurts more than helps:

Bursty workloads can use spare cycles when limits aren’t set.
With limits, requests bound CPU even when nodes have headroom.
Latency-sensitive apps suffer throttling at limits.

Common production pattern: set CPU requests (fairness / scheduling) but no CPU limits. Let the workload burst when there’s slack.

This requires good capacity planning — node CPU never fully saturated.

For multi-tenant clusters or cost control: keep CPU limits but at headroom multiples (2–3x request).

Memory limits — always set

Unlike CPU, memory has no “pause” mode. Without limits, one pod can consume all node memory and OOM-kill its neighbors. Always set memory limits.

QoS classes

QoS	When
Guaranteed	requests == limits for both CPU and memory
Burstable	requests set, limits set higher (or only memory limit set)
BestEffort	no requests / limits

When a node is under memory pressure, eviction order:

BestEffort first.
Burstable next.
Guaranteed last.

Production: aim for Guaranteed or Burstable with sensible limits.

Sizing methodology

1. Deploy with NO limits (or generous limits) to staging / canary.
2. Run representative load for a week.
3. Measure: p50, p95, p99 CPU and memory.
4. Set requests = p95 + 20% headroom.
5. Set memory limit = observed peak * 1.5.
6. Watch for throttling / OOM in production.
7. Iterate quarterly.

Without measurement: you’re guessing.

VPA (Vertical Pod Autoscaler)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: api-vpa }
spec:
  targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  updatePolicy: { updateMode: "Auto" }

VPA observes usage; recommends or auto-sets requests/limits. Runs alongside HPA (use HPA for horizontal scaling, VPA for vertical sizing).

Goldilocks

helm install goldilocks fairwinds/goldilocks

Web UI showing recommended requests/limits per workload from VPA observations. Quick way to find oversized / undersized pods.

HPA (Horizontal Pod Autoscaler)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Scale on CPU utilization (relative to requests). For better signals: scale on RPS, queue depth, or custom metrics via KEDA.

Common mistakes

1. Copy-paste limits from another service

Different workloads have different shapes. Measure each.

2. Memory limit too low → OOMKill

Pod restart loop. Increase memory limit; investigate why.

3. CPU limit too low → throttling

Latency spikes; users complain. Either remove the CPU limit or raise it.

4. No requests set

Scheduler doesn’t reserve resources. Pods can run anywhere; node packing erratic.

5. requests = limits for everything

“Always Guaranteed” means oversizing. Burstable + monitoring is often more cost-effective.

Capacity planning

Total cluster capacity = sum of node capacities
Allocatable per node ≈ 90% of node total (kubelet, OS overhead)
Bin-packing efficiency ≈ 70–85% in practice
Reserve headroom ≈ 20–30% for spikes / failover
Usable for workloads ≈ 60–70% of raw capacity

Don’t size at 100%. Save 25%+ for autoscaling, failures, deploys.

Cost optimization

	Cost lever
Right-size requests	Avoid wasted reserve (Goldilocks)
Spot instances	70% off; tolerant workloads
Bin-pack	Larger nodes; better utilization
HPA	Scale down off-peak
Karpenter	Just-in-time node provisioning

Karpenter has largely replaced cluster autoscaler in 2026.

Pod priority and preemption

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: { name: high-priority }
value: 1000000
globalDefault: false
description: "User-facing services"

Critical workloads can preempt low-priority ones during pressure. For mixed workloads (batch + serving): set priorities.

Real-world example

A team I worked with had a Python API:

Memory request: 2Gi (guess).
Actual p95: 600Mi.
Cost waste: 1.4Gi × N replicas reserved unused.

After Goldilocks + observation:

Memory request: 800Mi.
Memory limit: 1.5Gi (peak * 1.5).
Saved ~40% capacity in the cluster.

Sizing matters at scale.

What I’d ship today

For a new K8s deployment:

Memory limits always.
CPU requests, no CPU limits (unless multi-tenant fairness required).
VPA + Goldilocks to find right size.
HPA on CPU or custom metrics.
Karpenter for node autoscaling.
Monitoring for OOM and CPU throttling.
Quarterly sizing review.

Read this next

If you want my K8s sizing playbook + Goldilocks dashboards, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

requests vs limits#

OOMKill#

CPU throttling#

The “no CPU limits” debate#

Memory limits — always set#

QoS classes#

Sizing methodology#

VPA (Vertical Pod Autoscaler)#

Goldilocks#

HPA (Horizontal Pod Autoscaler)#

Common mistakes#

1. Copy-paste limits from another service#

2. Memory limit too low → OOMKill#

3. CPU limit too low → throttling#

4. No requests set#

5. requests = limits for everything#

Capacity planning#

Cost optimization#

Pod priority and preemption#

Real-world example#

What I’d ship today#

Read this next#