Advanced Docker & Container Security

Running Containers Securely

18 min Lesson 6 of 28

Running Containers Securely

A container is not a VM. It shares the host kernel, so a compromised container process with the wrong runtime configuration can escape to the host, read secrets from other containers, or pivot across your entire cluster. This lesson covers the four runtime controls that every production team at big-tech standard applies as a baseline: non-root users, read-only filesystems, Linux capabilities, and seccomp / AppArmor profiles. Each layer narrows the blast radius of a compromise independently; used together they create defense-in-depth that is hard to break through.

Why Root Inside a Container Is Dangerous

By default, processes inside a Docker container run as UID 0 (root). User namespaces provide some isolation, but if a container escape vulnerability exists — and they have, repeatedly — a root container process maps directly to root on the host. That means full access to every file on the host, every mounted volume, and every running process.

The fix is to declare a non-root user in your Dockerfile and create the user with a high, fixed UID. Big-tech images consistently use UIDs in the range 10000–65534 to avoid collisions with host system accounts:

# ---- multi-stage: build as root, run as non-root ----
FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server

FROM scratch
# Copy the binary and nothing else
COPY --from=builder /app /app

# On scratch images there is no adduser, so set the USER directly.
# Use a numeric UID that has no special meaning on any host.
USER 65532:65532

ENTRYPOINT ["/app"]

For images that do have a shell (Alpine, Debian-slim), create the user explicitly so its home directory and UID are deterministic:

FROM node:20-alpine

# Create a dedicated group and user with fixed UID/GID
RUN addgroup -S appgroup && adduser -S -G appgroup -u 10001 appuser

WORKDIR /app
COPY --chown=appuser:appgroup package*.json ./
RUN npm ci --omit=dev

COPY --chown=appuser:appgroup . .

# Drop to non-root before the entrypoint
USER 10001

EXPOSE 3000
CMD ["node", "server.js"]

Avoid USER nobody. The nobody user (UID 65534) is shared by many system daemons and can have unexpected filesystem permissions on some host configurations. Use a purpose-built user with a fixed, documented UID instead.

Read-Only Filesystems

Even with a non-root user, a compromised process can write malicious binaries into /tmp, overwrite application files, or install persistence tools if the container filesystem is writable. Mounting the entire container filesystem read-only with --read-only prevents this at the OS level — no chmod, no mv, no dropped shell script will survive past the current process.

Most applications need some writable space — for /tmp, PID files, or caches. The correct pattern is to grant write access only to specific, known directories using tmpfs, while keeping everything else immutable:

# Run with a read-only root filesystem.
# Grant writable tmpfs only where the app actually needs it.
docker run -d \
  --name api \
  --read-only \
  --tmpfs /tmp:rw,size=32m,noexec,nosuid \
  --tmpfs /run:rw,size=4m,noexec,nosuid \
  --user 10001:10001 \
  myorg/api:v2.4.1

# Confirm: try to write to / — it must fail
docker exec api touch /pwned  # should return: Read-only file system

In Kubernetes the equivalent is set in the pod security context:

# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
      containers:
        - name: api
          image: myorg/api:v2.4.1
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: run
              mountPath: /run
      volumes:
        - name: tmp
          emptyDir:
            medium: Memory
            sizeLimit: 32Mi
        - name: run
          emptyDir:
            medium: Memory
            sizeLimit: 4Mi

Setting allowPrivilegeEscalation: false in Kubernetes blocks setuid binaries and sudo inside the container, even if they exist in the image. This is separate from the read-only filesystem and should always be set alongside it.

Linux Capabilities — Precision Privilege Pruning

Root is not a single switch. The Linux kernel divides root privilege into ~40 independent capabilities — CAP_NET_BIND_SERVICE (bind ports below 1024), CAP_SYS_PTRACE (attach debuggers), CAP_SYS_ADMIN (almost everything else), and so on. Docker grants a container a default set of about 14 capabilities. That set is narrower than full root but still far wider than most application processes need.

The production approach is drop all capabilities, then add back only what the process actually requires:

# Drop ALL capabilities, then add only NET_BIND_SERVICE
# (needed if the process binds port 443 directly)
docker run -d \
  --name nginx \
  --user 65532 \
  --read-only \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  myorg/nginx:1.27

# Most stateless API services need ZERO capabilities:
docker run -d \
  --name grpc-service \
  --user 10001 \
  --read-only \
  --cap-drop ALL \
  myorg/grpc-svc:latest

Dropping all Linux capabilities and restoring only those the process needs reduces the kernel attack surface from ~14 capabilities to often zero.

Run docker run --rm -it --cap-drop ALL ubuntu:24.04 capsh --print to inspect the capability set interactively. Use pscap (from the libcap-ng-utils package) on the host to verify what capabilities running container processes actually hold.

Seccomp — Syscall Filtering

Even with all capabilities dropped, a container process can still invoke hundreds of Linux system calls. A compromised process can use syscalls like ptrace, keyctl, or clone with specific flags to attempt privilege escalation. Seccomp (Secure Computing Mode) is a Linux kernel feature that restricts which syscalls a process may invoke. Docker ships a default seccomp profile that blocks ~44 dangerous syscalls while allowing everything a well-behaved application needs.

For higher assurance, generate a tight application-specific profile. The workflow: run your application in SCMP_ACT_LOG mode to record every syscall it makes, then build an allowlist from that log. Tools like oci-seccomp-bpf-hook or Falco can automate the recording step:

# Apply a custom seccomp profile at runtime
docker run -d \
  --name api \
  --security-opt seccomp=/etc/docker/seccomp/api-profile.json \
  myorg/api:v2.4.1

# Use Docker's built-in default explicitly (same as omitting the flag on most installs)
docker run -d \
  --security-opt seccomp=default \
  myorg/api:v2.4.1

# Disable seccomp entirely — ONLY for debugging, NEVER in production
docker run --security-opt seccomp=unconfined myorg/api:v2.4.1

A minimal seccomp profile JSON has this structure — it defaults to allow and blocks named syscalls (or vice versa with an allowlist approach):

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": [
        "accept4", "bind", "brk", "clone", "close", "connect",
        "epoll_create1", "epoll_ctl", "epoll_wait", "execve",
        "exit", "exit_group", "fcntl", "fstat", "futex",
        "getpid", "getrandom", "getsockname", "getsockopt",
        "listen", "mmap", "mprotect", "munmap", "nanosleep",
        "open", "openat", "poll", "prctl", "read", "recvfrom",
        "recvmsg", "rt_sigaction", "rt_sigprocmask", "sendmsg",
        "sendto", "setitimer", "setsockopt", "sigaltstack",
        "socket", "stat", "tgkill", "uname", "write"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

AppArmor — Mandatory Access Control for Paths and Networks

AppArmor (Application Armor) is a Linux Security Module that enforces a Mandatory Access Control policy on top of discretionary Unix permissions. Where seccomp operates at the syscall level, AppArmor operates at the object level — it controls which files, directories, sockets, and network operations a specific executable is allowed to perform, regardless of the Unix user running it.

Docker applies a default AppArmor profile named docker-default to every container on systems that have AppArmor enabled (Ubuntu, Debian, openSUSE). You can load and assign a custom profile per container:

# Load a custom AppArmor profile (run once on the host)
apparmor_parser -r -W /etc/apparmor.d/docker-api-profile

# Apply the profile when running the container
docker run -d \
  --name api \
  --security-opt apparmor=docker-api-profile \
  myorg/api:v2.4.1

# Verify which profile is active
docker inspect api | grep -i apparmor

A production AppArmor profile for a Go HTTP service looks like this — it whitelists only the paths the process legitimately touches:

#include <tunables/global>

profile docker-api-profile flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  network inet tcp,
  network inet udp,

  # Binary and shared libraries
  /app r ix,
  /lib/** mr,
  /usr/lib/** mr,

  # App writes only to its own tmp directory
  /tmp/** rw,
  /run/** rw,

  # Block everything else
  deny /proc/** w,
  deny /sys/** w,
  deny /etc/shadow r,
}

AppArmor and seccomp are complementary, not alternatives. Seccomp filters at the syscall number level; AppArmor filters at the resource-path and network-operation level. Google's gVisor and Amazon's Firecracker take isolation even further, running containers inside a sandboxed kernel — relevant context once you move into high-assurance multi-tenant environments.

Putting It All Together — A Hardened Run Command

These four controls compose cleanly. A production container launch that applies all of them simultaneously:

docker run -d \
  --name api \
  --user 10001:10001 \
  --read-only \
  --tmpfs /tmp:rw,size=32m,noexec,nosuid,uid=10001 \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --security-opt no-new-privileges:true \
  --security-opt seccomp=/etc/docker/seccomp/api-tight.json \
  --security-opt apparmor=docker-api-profile \
  --restart unless-stopped \
  myorg/api:v2.4.1

The flag --security-opt no-new-privileges:true deserves special mention: it sets the PR_SET_NO_NEW_PRIVS prctl bit, which prevents any process in the container — including child processes — from gaining additional privileges through execve, setuid binaries, or filesystem capabilities on executables. It is the final safety net after all the controls above.

Automate compliance with these controls using Open Policy Agent / Gatekeeper in Kubernetes. Write a ConstraintTemplate that rejects pods missing readOnlyRootFilesystem: true, runAsNonRoot: true, and allowPrivilegeEscalation: false. Gate policies in CI with conftest before any manifest ever reaches the cluster.