Resource Limits & cgroups
Resource Limits & cgroups
A container is, at its core, a process — or a tree of processes — running on the host kernel. Without explicit limits, a single runaway container can consume every CPU cycle and every byte of RAM on the host, bringing down every other container and the node itself. Linux control groups (cgroups) are the kernel mechanism that prevents this: they enforce hard boundaries on CPU, memory, I/O, and PIDs per process group. Docker and Kubernetes both sit on top of cgroups to implement their resource-limit models.
How cgroups Work Under the Hood
When Docker starts a container, the daemon creates a cgroup hierarchy under /sys/fs/cgroup/ and places every container process inside it. The kernel then enforces the limits you set at that hierarchy level — no matter how aggressively the container tries to break out. With cgroups v2 (the default on Linux 5.2+ and all modern distros), the hierarchy is unified into a single tree and the accounting is more accurate, particularly for memory.
memory.max knob replaces memory.limit_in_bytes from v1, and CPU is controlled via cpu.max instead of cpu.cfs_quota_us. The Docker and kubectl CLI flags remain the same regardless — the engine translates for you.
Memory Limits and OOM Behavior
Memory is the most dangerous unbounded resource. When a container exceeds its memory limit, the kernel OOM (Out-Of-Memory) killer terminates a process inside the cgroup. In Docker, this manifests as the container exiting with status 137 (SIGKILL from the kernel). Without a limit, the OOM killer may choose any process on the host — including the Docker daemon or a process in a completely unrelated container.
--memory-swap equal to --memory. That disables swap entirely, which sounds safe but causes OOM kills at the memory limit with no soft landing. In production, either allow a small amount of swap (1.5x-2x the memory limit) or disable swap at the host level entirely for latency-sensitive workloads and rely on fast OOM kills as your circuit breaker.
You can also configure what happens before the OOM kill using --oom-score-adj. The kernel OOM killer scores each process between -1000 (never kill) and +1000 (kill first). Setting a high score on your container makes it the preferred victim, protecting host-level processes.
CPU Limits: Shares, Quotas, and Periods
CPU limiting works differently from memory because CPU time is compressible — a container that requests more than its share is throttled, not killed. Docker exposes two independent knobs:
--cpus(or--cpu-quota+--cpu-period) — sets a hard ceiling.--cpus=1.5means the container may use at most 1.5 CPU-seconds per second, regardless of available capacity. Implemented as a CFS (Completely Fair Scheduler) quota.--cpu-shares— sets a relative weight (default 1024). Only takes effect when CPUs are contended. A container with 2048 shares gets twice the CPU time of a 1024-share container when both are busy. When the host is idle, shares are irrelevant — a low-share container can burst freely.
ulimits: File Descriptors and Process Counts
Beyond CPU and memory, two other resources cause subtle production failures: open file descriptors and process/thread counts. Both are controlled via ulimit-style settings that Docker inherits from the daemon default and that you can override per container.
A high-concurrency service (database, web server, message broker) can exhaust file descriptor limits under load, causing cryptic "too many open files" errors long before it runs out of CPU or memory. Likewise, a fork bomb or a thread-leaking JVM can consume all PIDs on the host, rendering the node unable to start any new process — including recovery scripts.
Kubernetes Resource Requests and Limits
In Kubernetes, CPU and memory are configured at the container level inside a Pod spec. There are two distinct concepts: requests (the scheduler guarantee — the node must have this much free) and limits (the cgroup ceiling — the container cannot exceed this). Setting only limits without requests is a common mistake that leads to the scheduler placing too many pods on a node.
container_cpu_cfs_throttled_seconds_total) — heavy throttling is as damaging as OOM kills for latency-sensitive services.
Verifying and Monitoring Limits in Production
Setting limits is only half the work. You must also verify they are being enforced and alert when containers approach them. Key signals to watch:
- Memory usage percentage — alert at 80% of the limit. At 100% you get killed with no warning.
- OOM kill counter —
container_oom_events_totalin Prometheus; any value above zero is a production incident. - CPU throttle ratio —
container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total; above 25% suggests your CPU limit is too low. - File descriptor usage — compare
/proc/<pid>/fdcount against the ulimit for long-running services.
LimitRange object in every namespace — it automatically injects default requests and limits for any pod that omits them, so a misconfigured deployment cannot bypass the policy.