Advanced Docker & Container Security

Registries in Production

18 min Lesson 9 of 28

Registries in Production

Every container image that runs in production was pulled from a registry — a versioned, content-addressable artifact store indexed by name and tag. Docker Hub is fine for open-source experimentation, but production teams at scale run private registries for security, latency, cost, and compliance reasons. This lesson covers the major managed options (ECR, GAR, Harbor), the operational practices that keep them healthy — image lifecycle policies, vulnerability scanning hooks, and pull-through caches — and the failure modes that bite teams that skip this infrastructure work.

Why Private Registries Matter

Public registries impose rate limits (Docker Hub: 100 anonymous / 200 authenticated pulls per six hours per IP). In a Kubernetes cluster with 50 nodes all pulling the same image on startup, you hit that wall in seconds. Beyond rate limits, private registries provide:

Access control: IAM policies or RBAC govern who can push or pull which repositories.
Network locality: Images pulled from a registry in the same cloud region skip egress charges and reduce pull latency from minutes to seconds for large images.
Audit trail: Every push and pull is logged, which is mandatory for SOC 2 and ISO 27001 compliance.
Immutability enforcement: Tag overwriting can be disabled, ensuring that v1.4.2 always refers to the same digest.

Amazon ECR (Elastic Container Registry)

ECR is the natural choice for workloads running on AWS (ECS, EKS, Lambda). Authentication uses short-lived tokens issued by the AWS CLI — no long-lived credentials stored in ~/.docker/config.json.

# Authenticate the Docker client to ECR (token expires after 12 hours)
aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin \
    123456789012.dkr.ecr.us-east-1.amazonaws.com

# Create a private repository (done once, usually in Terraform)
aws ecr create-repository \
  --repository-name myapp/api \
  --image-scanning-configuration scanOnPush=true \
  --image-tag-mutability IMMUTABLE \
  --region us-east-1

# Tag and push
docker tag myapp:latest \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp/api:v1.4.2
docker push \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp/api:v1.4.2

# List images with their digest and pushed-at timestamp
aws ecr describe-images \
  --repository-name myapp/api \
  --query 'sort_by(imageDetails, &imagePushedAt)[*].[imageTags[0],imageDigest,imagePushedAt]' \
  --output table

Enable immutable tags in ECR from day one. Mutable tags mean a CI pipeline can silently overwrite production with a broken image — and the rollback you thought was pointing at the old digest is now pointing at the new broken one. Immutable tags force a unique version tag per release and make rollbacks deterministic.

ECR Lifecycle Policies automate cleanup. Without them, a repository that receives 20 pushes per day accumulates thousands of images and your storage bill grows unbounded. A lifecycle policy like the one below keeps the last 30 tagged releases and auto-expires untagged layers (leftover from failed builds):

# ecr-lifecycle-policy.json
{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Expire untagged images after 1 day",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": { "type": "expire" }
    },
    {
      "rulePriority": 2,
      "description": "Keep last 30 tagged releases",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["v"],
        "countType": "imageCountMoreThan",
        "countNumber": 30
      },
      "action": { "type": "expire" }
    }
  ]
}

# Apply the policy
aws ecr put-lifecycle-policy \
  --repository-name myapp/api \
  --lifecycle-policy-text file://ecr-lifecycle-policy.json

Google Artifact Registry (GAR)

GAR is the successor to Google Container Registry (GCR) and is the right choice for GKE workloads. It supports multiple formats (Docker, Helm, npm, Maven) in a single service. Authentication uses Workload Identity Federation on GKE — no service account key files.

# Authenticate (on a workstation; GKE nodes use Workload Identity automatically)
gcloud auth configure-docker us-central1-docker.pkg.dev

# Create a repository (Terraform is preferred for production)
gcloud artifacts repositories create myapp \
  --repository-format=docker \
  --location=us-central1 \
  --description="Production images for myapp"

# Push an image
docker tag myapp:latest \
  us-central1-docker.pkg.dev/my-gcp-project/myapp/api:v1.4.2
docker push \
  us-central1-docker.pkg.dev/my-gcp-project/myapp/api:v1.4.2

# Set a cleanup policy: delete untagged images older than 7 days
gcloud artifacts repositories set-cleanup-policies myapp \
  --location=us-central1 \
  --policy='{
    "name": "delete-untagged",
    "action": {"type": "Delete"},
    "condition": {
      "tagState": "UNTAGGED",
      "olderThan": "604800s"
    }
  }'

Harbor — Self-Hosted, Cloud-Agnostic

Harbor is the CNCF-graduated open-source registry that teams choose when they need on-premises storage (air-gapped environments, strict data residency, multi-cloud without coupling to one vendor). It adds a UI, RBAC, quotas, replication rules, and built-in Trivy scanning on top of the OCI Distribution Spec.

Images flow: CI/CD pushes to a private registry; Kubernetes nodes pull through a regional cache; a vulnerability scanner inspects every pushed image.

Pull-Through Caches

A pull-through cache is a registry proxy that sits between your nodes and an upstream registry (Docker Hub, Quay, or even your own private registry in a remote region). On the first pull, the proxy fetches and caches the image layer. Subsequent pulls are served locally — surviving upstream outages and rate limits.

ECR supports pull-through cache rules natively since 2022. The setup is a one-time Terraform or CLI operation:

# Create a pull-through cache rule that mirrors Docker Hub into ECR
aws ecr create-pull-through-cache-rule \
  --ecr-repository-prefix "dockerhub" \
  --upstream-registry-url "registry-1.docker.io" \
  --region us-east-1

# After this, any node pulling:
#   123456789012.dkr.ecr.us-east-1.amazonaws.com/dockerhub/library/nginx:1.27
# will transparently pull from Docker Hub on cache miss, then serve from ECR.

# For a multi-region setup, replicate images with ECR replication rules
aws ecr create-replication-configuration --replication-configuration '{
  "rules": [{
    "destinations": [
      {"region": "eu-west-1", "registryId": "123456789012"},
      {"region": "ap-southeast-1", "registryId": "123456789012"}
    ],
    "repositoryFilters": [{
      "filter": "myapp/",
      "filterType": "PREFIX_MATCH"
    }]
  }]
}'

Harbor pull-through (proxy cache): Harbor calls this feature "Proxy Cache." You configure an endpoint pointing at Docker Hub (or any OCI registry) and create a dedicated project backed by it. Teams then pull from harbor.internal/dockerhub-cache/library/nginx:1.27 instead of Docker Hub. Harbor respects the upstream cache-control headers and re-fetches when the upstream image is updated.

Image Retention and Lifecycle Best Practices

At big-tech scale, a single service with 10 pushes per day generates 3,650 images per year. Multiply by 200 services and you have 730,000 images — most of which have never been pulled after their initial deploy. Good lifecycle hygiene follows these rules:

Tag semver releases explicitly (v1.4.2), never deploy from latest. Lifecycle policies should only expire images tagged with a prefix that maps to ephemeral builds (e.g., sha- or branch-).
Retain a rolling window of recent releases (30–90 days, or last N versions) so rollback is always available without needing to rebuild.
Expire untagged blobs aggressively (24–48 hours). Untagged images are orphaned push artifacts from failed or aborted CI runs — they consume storage and are never deployed.
Block on critical CVEs at push time. ECR Enhanced Scanning (backed by AWS Inspector) and GAR's built-in scanning can be wired into CI to fail the pipeline when a CRITICAL vulnerability is found, before the image ever reaches production.

Never delete an image that is currently deployed. Lifecycle policies run asynchronously on a schedule. If you aggressively expire "old" images and a Kubernetes node needs to reschedule a pod (e.g., after a node restart), the kubelet will fail to pull a now-deleted image and the pod will enter ImagePullBackOff. Keep at least the last-deployed digest pinned, or use digest-based tags (sha256:abc123) that you explicitly manage separately from the expiry window.

Authentication Patterns at Scale

Long-lived credentials in imagePullSecrets are a security liability — rotate them and you need to update every namespace. The production pattern is node-level IAM: on EKS, assign the node IAM role ECR read permissions; on GKE, use Workload Identity. The kubelet then authenticates transparently using the node's credential provider, and no secrets need to be managed at the application level.

For cross-account or cross-project pulls (e.g., a shared "golden image" registry accessed by many teams), use ECR cross-account policies or GAR IAM bindings that grant the consuming project's service account roles/artifactregistry.reader on the source project — no credentials, just IAM delegation.