Containers vs VMs
Containers vs VMs
Before you write a single Dockerfile, you need a precise mental model of what a container actually is at the kernel level. Engineers who skip this step treat Docker as a black box and hit mysterious failures in production — from privilege-escalation security holes to processes that escape their expected resource limits. This lesson builds the model from first principles.
The Virtual Machine Model
A virtual machine achieves isolation by emulating an entire hardware stack. A hypervisor (VMware ESXi, KVM, Hyper-V, AWS Nitro) sits between the physical hardware and one or more guest operating systems. Each guest gets a slice of CPU, RAM, and disk that looks — from the guest's perspective — like dedicated hardware. The guest boots its own kernel, manages its own memory pages, and runs its own init system (systemd, OpenRC, etc.).
This model is strong: a bug in the guest kernel cannot corrupt the host kernel because the two kernels never share memory. The attack surface between tenant VMs is the hypervisor, and hypervisors are small, heavily audited codebases. Multi-tenant cloud providers (AWS, GCP, Azure) rely on this guarantee to run competing customers on the same physical host.
But the model is also heavy. Booting a VM requires loading a full kernel (seconds to minutes), allocating memory for the OS overhead (often 200 MB–1 GB of RAM just for the guest OS before your application starts), and maintaining a complete disk image. Cold-starting 50 VMs in response to a traffic spike is measured in minutes, not seconds.
The Container Model: Two Kernel Primitives
Containers are not a new concept invented by Docker. They are a Linux kernel feature that Docker packaged into a usable developer experience in 2013. Two kernel subsystems do the real work: namespaces and cgroups.
Namespaces: What You Can See
A Linux namespace is a wrapper around a global system resource that makes the process inside the namespace believe it has its own isolated instance of that resource. The kernel maintains separate namespaces for:
- pid — Process ID namespace. PID 1 inside a container is just a regular process on the host (say, PID 3841), but the container sees it as PID 1. The container cannot see or signal host processes or processes in other containers.
- net — Network namespace. Each container gets its own loopback interface, its own routing table, its own
iptablesrules, and its own set of sockets. Two containers can both bind port 8080 without conflict because they live in different network namespaces. - mnt — Mount namespace. The container's filesystem view is isolated. The container sees a root filesystem (the image layers) that is different from the host's
/. You can mount host directories into this namespace, which is the foundation of Docker volumes. - uts — UNIX Time-sharing System namespace. The container can have its own hostname and domain name, independent of the host.
- ipc — Inter-process communication namespace. Shared memory segments and semaphores are isolated per namespace, preventing cross-container IPC.
- user — User namespace. Maps user IDs inside the container to different user IDs on the host. The container's root (UID 0) can map to an unprivileged UID (e.g., UID 100000) on the host — the foundation of rootless containers.
- cgroup (Linux 4.6+) — Hides the host cgroup hierarchy from the container, so it sees only its own resource limits as the top-level limits.
You can inspect the namespace membership of any process directly from the host:
cgroups: What You Can Consume
Namespaces control visibility. Control groups (cgroups) control consumption. A cgroup is a kernel mechanism for grouping processes and enforcing limits on the resources they collectively use. Docker translates every --memory, --cpus, and --pids-limit flag into cgroup entries in /sys/fs/cgroup/.
The two cgroup versions differ significantly:
- cgroups v1 (legacy) — Each resource controller (cpu, memory, blkio, pids, …) has its own independent hierarchy under
/sys/fs/cgroup/<controller>/. A process can be in different positions in different hierarchies simultaneously — complex and error-prone. - cgroups v2 (unified, default since kernel 5.8 / Ubuntu 22.04 / RHEL 9) — A single unified hierarchy. All controllers are under
/sys/fs/cgroup/. Simpler delegation model, better support for rootless containers, pressure stall information (PSI) for memory and CPU, and improved OOM killer behavior. Always prefer v2 on new systems.
--memory without --memory-swap allows the container to use additional swap equal to the memory limit (total swap = 2× memory). On a host with heavy swap usage this causes severe latency spikes. Always set both flags, or set --memory-swap equal to --memory to disable swap entirely for latency-sensitive workloads.The Architectural Difference — Visualized
The diagram below shows exactly what is shared and what is isolated in each model. This is the diagram to internalize:
Why Containers Won (Operationally)
The container model delivers three practical advantages that drove adoption at scale:
- Startup time: A container process starts in 50–300 ms because no kernel needs to boot. A VM needs 5–60 seconds even with an optimized image. At Kubernetes scale — where pods are created and destroyed continuously in response to load — this difference determines whether autoscaling can keep up with traffic spikes.
- Density: A 4 vCPU / 16 GB RAM VM running bare Ubuntu loses roughly 1–2 GB to the OS before your application sees a byte. Containers add almost no OS overhead (the host kernel is already running). On the same hardware you might run 5 VMs or 150 containers, which is the economic driver behind container orchestration.
- Image portability: A container image bundles exactly the libraries and binaries the application needs. The image that passes your CI pipeline is the exact same binary artifact that runs in production. With VMs, the "works on my machine" problem was partly replaced by "works in staging" — configuration drift between image builds and VM base images was a constant operational burden.
The Security Trade-off — What Containers Are NOT
The shared kernel is the source of containers' speed — and their primary security limitation. If a process inside a container exploits a kernel vulnerability (a container escape), it can gain access to the host and all other containers on it. This attack surface does not exist with VMs, because a kernel vulnerability in a guest VM cannot cross the hypervisor.
Google's gVisor and Amazon's Firecracker address this by adding an additional isolation layer — gVisor interposes on syscalls with a userspace kernel, Firecracker runs containers inside lightweight VMs (microVMs) that boot in 125 ms. Kubernetes itself supports the RuntimeClass API to schedule specific pods onto more isolated runtimes.
For most workloads on a private cluster, Linux namespaces + cgroups + seccomp profiles + AppArmor/SELinux provide adequate isolation. For multi-tenant SaaS (running untrusted customer code) or anything processing sensitive regulated data on shared infrastructure, the defense-in-depth argument for gVisor or Firecracker is strong.
Putting It Together: What Happens When You Run a Container
When you execute docker run nginx:alpine, the following sequence happens — most of it in under 300 ms:
- Docker CLI sends a gRPC request to the Docker daemon (
dockerd). dockerddelegates to containerd (the industry-standard container runtime, now a CNCF project).- containerd invokes runc (the OCI-compliant low-level runtime) to create the container.
- runc calls
clone(2)with namespace flags (CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) to create a new process in fresh namespaces. - runc writes cgroup entries under
/sys/fs/cgroup/to enforce CPU and memory limits. - The image layers are mounted as an overlay filesystem (OverlayFS) on the host and presented to the container as its root filesystem.
- The container process starts — it sees PID 1, a private network interface, and an isolated filesystem.
This is the complete mental model: namespaces for visibility isolation, cgroups for resource enforcement, and a layered filesystem for image portability. Everything else in Docker — Dockerfiles, volumes, networks, Compose — is built on top of these three primitives.