Kubernetes Workloads & Configuration

Resource Requests & Limits

18 min Lesson 4 of 32

Resource Requests & Limits

Every container in Kubernetes can declare two numbers for each compute resource — CPU and memory: a request and a limit. These four values (CPU request, CPU limit, memory request, memory limit) drive three distinct mechanisms: scheduling decisions, CPU throttling, and OOM kills. Confusing them — or omitting them entirely — is responsible for a large fraction of production incidents in Kubernetes clusters.

This lesson covers all three mechanisms in depth, explains how they map to Linux kernel primitives, and shows how Kubernetes Quality of Service (QoS) classes determine which Pods survive a memory-starved node.

Requests: What the Scheduler Sees

The request is a scheduling hint. It tells the Kubernetes scheduler: "I need at least this much CPU and memory to start." The scheduler sums the requests of all Pods already running on a node and only places a new Pod there if enough allocatable capacity remains. It is based purely on the request — not on what the container is actually using at runtime.

CPU requests are expressed in m (millicores). One core = 1000m. A request of 250m means the container needs one-quarter of a CPU. Memory requests are in bytes, with suffixes Mi (mebibytes) and Gi (gibibytes). Under the hood, the request sets the cpu.shares value in the Linux CFS (Completely Fair Scheduler), which controls proportional CPU time when the system is under contention. A container with request: 500m gets twice the CPU time of one with request: 250m when the node is saturated.

Key insight: The scheduler bins-packs based on requests, not actual usage. If you request 4Gi but only ever use 500Mi, you are blocking 3.5Gi of allocatable capacity on every node where your Pod lands. Over-requesting wastes cluster capacity; under-requesting causes scheduling failures or unexpected OOM kills.

Limits: Throttling and OOM

The limit is an enforcement ceiling enforced by the Linux kernel, not by Kubernetes itself. CPU and memory limits behave very differently:

CPU limit — throttling: Implemented via CFS bandwidth control (cpu.cfs_quota_us). If a container exceeds its CPU limit for a scheduling period (default 100ms), the kernel throttles it — the process is paused for the remainder of that period. The container stays running; it just gets fewer CPU cycles. This is invisible to the application, but manifests as higher latency and reduced throughput. In production, CPU throttling is one of the hardest problems to diagnose because the Pod appears healthy.
Memory limit — OOM kill: Memory is not compressible. If a container allocates more memory than its limit, the Linux kernel OOM killer sends SIGKILL to a process in that cgroup — usually the container's main process. Kubernetes then restarts the container. You will see OOMKilled in kubectl describe pod.

# A realistic production-grade Pod spec with requests and limits
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
  - name: api
    image: mycompany/api:v2.1.0
    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
      limits:
        cpu: "1000m"
        memory: "512Mi"

# Inspect actual vs. requested resources on nodes
kubectl top nodes

# Inspect actual vs. requested resources per Pod
kubectl top pods --all-namespaces --sort-by=memory

# Detect OOMKilled containers — look for reason OOMKilled in Last State
kubectl describe pod <pod-name> | grep -A 10 "Last State:"

# Detect CPU throttling using cAdvisor metrics (via Prometheus query)
# container_cpu_cfs_throttled_seconds_total / container_cpu_cfs_periods_total
# A ratio above 0.25 (25%) signals meaningful throttle pressure

QoS Classes: Who Lives When the Node Is Starved

Kubernetes assigns every Pod one of three Quality of Service classes at creation time. The class determines the Pod's eviction priority when a node runs out of memory. The class is derived automatically from the requests and limits you set — you cannot set it directly.

Guaranteed: Every container in the Pod has equal requests and limits for both CPU and memory (neither can be omitted). These Pods are evicted last. Use this class for latency-sensitive services (payment processing, authentication) and for StatefulSet members holding data.
Burstable: At least one container has a request or limit set, but they are not equal. This is the most common class in practice. The kubelet evicts Burstable Pods when it cannot evict any BestEffort Pods and memory pressure continues.
BestEffort: No container in the Pod has any requests or limits set. These Pods are the first to be evicted under memory pressure. Appropriate only for batch jobs or development workloads that can tolerate arbitrary termination.

QoS classes and eviction order: BestEffort Pods are evicted first, Guaranteed Pods last, when a node runs low on memory.

Setting the Right Values in Production

The hard problem is choosing the right numbers. Too low a memory limit causes OOM kills under normal load spikes; too high a request prevents the scheduler from fitting Pods onto nodes. The production workflow at big-tech companies follows a consistent pattern:

Profile first: Deploy with generous limits and no LimitRange. Run realistic load (production traffic replay or a load test at peak). Collect container_memory_working_set_bytes and container_cpu_usage_seconds_total from Prometheus over 48–72 hours covering weekday and weekend patterns.
Set the request at p90 of observed usage. This leaves headroom for bursts while accurately reflecting typical consumption for scheduling.
Set the memory limit at p99 + 20% buffer. Never set memory limit equal to request in a Burstable service — your application will OOM kill on every normal spike.
For Guaranteed Pods: Set requests equal to limits only when you need predictable latency and can afford to pre-reserve the full allocation at all times. Accept the cost: a Guaranteed Pod with cpu: 2 claims 2 full cores on the node even when idle.

# Use VPA (Vertical Pod Autoscaler) in recommendation mode to get data-driven suggestions
# Install VPA (if not present)
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

# Create a VPA object in recommend-only mode (does not mutate Pods)
cat <<'EOF' | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"   # recommend only; change to Auto to auto-resize
EOF

# After 24 hours, read the recommendations
kubectl describe vpa api-server-vpa
# Look for: "Lower Bound", "Target", "Upper Bound" per container

Production tip — namespace LimitRange: Set a LimitRange in every namespace so that containers without explicit requests/limits get sensible defaults and are never BestEffort by accident. Define a defaultRequest and a default (limit). This prevents a developer from deploying a container that quietly grabs all available memory and starves neighbours.

# Namespace-level LimitRange — set defaults and enforce a ceiling
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:            # applied as limit when none is specified
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:     # applied as request when none is specified
      cpu: "100m"
      memory: "128Mi"
    max:                # hard ceiling — no container may exceed this
      cpu: "4"
      memory: "4Gi"
    min:
      cpu: "50m"
      memory: "64Mi"

Production pitfall — CPU limits in latency-sensitive services: Many SRE teams at top-tier companies have removed CPU limits from latency-sensitive Deployments while keeping CPU requests. The reason: CFS throttling can add 10–100ms of latency per request even when the node has spare capacity, because the kernel pauses the process the moment it exceeds its quota for a 100ms window. If your SLO is p99 < 50ms, a CPU limit is likely your hidden enemy. Monitor container_cpu_cfs_throttled_periods_total in Grafana and act when the throttle ratio exceeds 20%.

ResourceQuota: Cluster-Level Governance

A ResourceQuota object enforces aggregate ceilings across an entire namespace — for example, a staging namespace cannot collectively consume more than 20 CPU cores and 40Gi of memory. It works alongside LimitRange: the LimitRange defines per-container defaults and ceilings; the ResourceQuota defines the total budget for the namespace. Both are mandatory in multi-tenant clusters where teams share a single cluster.

# ResourceQuota for a team namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    services: "10"
    persistentvolumeclaims: "20"

Understanding resource requests, limits, QoS classes, LimitRanges, and ResourceQuotas is foundational to operating Kubernetes at any serious scale. These primitives interact with the autoscaler (covered in Lesson 8) and with node eviction, making them one of the highest-leverage configurations you control as a DevOps engineer.