Running Containers Securely
Running Containers Securely
A container is not a VM. It shares the host kernel, so a compromised container process with the wrong runtime configuration can escape to the host, read secrets from other containers, or pivot across your entire cluster. This lesson covers the four runtime controls that every production team at big-tech standard applies as a baseline: non-root users, read-only filesystems, Linux capabilities, and seccomp / AppArmor profiles. Each layer narrows the blast radius of a compromise independently; used together they create defense-in-depth that is hard to break through.
Why Root Inside a Container Is Dangerous
By default, processes inside a Docker container run as UID 0 (root). User namespaces provide some isolation, but if a container escape vulnerability exists — and they have, repeatedly — a root container process maps directly to root on the host. That means full access to every file on the host, every mounted volume, and every running process.
The fix is to declare a non-root user in your Dockerfile and create the user with a high, fixed UID. Big-tech images consistently use UIDs in the range 10000–65534 to avoid collisions with host system accounts:
For images that do have a shell (Alpine, Debian-slim), create the user explicitly so its home directory and UID are deterministic:
USER nobody. The nobody user (UID 65534) is shared by many system daemons and can have unexpected filesystem permissions on some host configurations. Use a purpose-built user with a fixed, documented UID instead.
Read-Only Filesystems
Even with a non-root user, a compromised process can write malicious binaries into /tmp, overwrite application files, or install persistence tools if the container filesystem is writable. Mounting the entire container filesystem read-only with --read-only prevents this at the OS level — no chmod, no mv, no dropped shell script will survive past the current process.
Most applications need some writable space — for /tmp, PID files, or caches. The correct pattern is to grant write access only to specific, known directories using tmpfs, while keeping everything else immutable:
In Kubernetes the equivalent is set in the pod security context:
allowPrivilegeEscalation: false in Kubernetes blocks setuid binaries and sudo inside the container, even if they exist in the image. This is separate from the read-only filesystem and should always be set alongside it.
Linux Capabilities — Precision Privilege Pruning
Root is not a single switch. The Linux kernel divides root privilege into ~40 independent capabilities — CAP_NET_BIND_SERVICE (bind ports below 1024), CAP_SYS_PTRACE (attach debuggers), CAP_SYS_ADMIN (almost everything else), and so on. Docker grants a container a default set of about 14 capabilities. That set is narrower than full root but still far wider than most application processes need.
The production approach is drop all capabilities, then add back only what the process actually requires:
docker run --rm -it --cap-drop ALL ubuntu:24.04 capsh --print to inspect the capability set interactively. Use pscap (from the libcap-ng-utils package) on the host to verify what capabilities running container processes actually hold.
Seccomp — Syscall Filtering
Even with all capabilities dropped, a container process can still invoke hundreds of Linux system calls. A compromised process can use syscalls like ptrace, keyctl, or clone with specific flags to attempt privilege escalation. Seccomp (Secure Computing Mode) is a Linux kernel feature that restricts which syscalls a process may invoke. Docker ships a default seccomp profile that blocks ~44 dangerous syscalls while allowing everything a well-behaved application needs.
For higher assurance, generate a tight application-specific profile. The workflow: run your application in SCMP_ACT_LOG mode to record every syscall it makes, then build an allowlist from that log. Tools like oci-seccomp-bpf-hook or Falco can automate the recording step:
A minimal seccomp profile JSON has this structure — it defaults to allow and blocks named syscalls (or vice versa with an allowlist approach):
AppArmor — Mandatory Access Control for Paths and Networks
AppArmor (Application Armor) is a Linux Security Module that enforces a Mandatory Access Control policy on top of discretionary Unix permissions. Where seccomp operates at the syscall level, AppArmor operates at the object level — it controls which files, directories, sockets, and network operations a specific executable is allowed to perform, regardless of the Unix user running it.
Docker applies a default AppArmor profile named docker-default to every container on systems that have AppArmor enabled (Ubuntu, Debian, openSUSE). You can load and assign a custom profile per container:
A production AppArmor profile for a Go HTTP service looks like this — it whitelists only the paths the process legitimately touches:
Putting It All Together — A Hardened Run Command
These four controls compose cleanly. A production container launch that applies all of them simultaneously:
The flag --security-opt no-new-privileges:true deserves special mention: it sets the PR_SET_NO_NEW_PRIVS prctl bit, which prevents any process in the container — including child processes — from gaining additional privileges through execve, setuid binaries, or filesystem capabilities on executables. It is the final safety net after all the controls above.
ConstraintTemplate that rejects pods missing readOnlyRootFilesystem: true, runAsNonRoot: true, and allowPrivilegeEscalation: false. Gate policies in CI with conftest before any manifest ever reaches the cluster.