Registries in Production
Registries in Production
Every container image that runs in production was pulled from a registry — a versioned, content-addressable artifact store indexed by name and tag. Docker Hub is fine for open-source experimentation, but production teams at scale run private registries for security, latency, cost, and compliance reasons. This lesson covers the major managed options (ECR, GAR, Harbor), the operational practices that keep them healthy — image lifecycle policies, vulnerability scanning hooks, and pull-through caches — and the failure modes that bite teams that skip this infrastructure work.
Why Private Registries Matter
Public registries impose rate limits (Docker Hub: 100 anonymous / 200 authenticated pulls per six hours per IP). In a Kubernetes cluster with 50 nodes all pulling the same image on startup, you hit that wall in seconds. Beyond rate limits, private registries provide:
- Access control: IAM policies or RBAC govern who can push or pull which repositories.
- Network locality: Images pulled from a registry in the same cloud region skip egress charges and reduce pull latency from minutes to seconds for large images.
- Audit trail: Every push and pull is logged, which is mandatory for SOC 2 and ISO 27001 compliance.
- Immutability enforcement: Tag overwriting can be disabled, ensuring that
v1.4.2always refers to the same digest.
Amazon ECR (Elastic Container Registry)
ECR is the natural choice for workloads running on AWS (ECS, EKS, Lambda). Authentication uses short-lived tokens issued by the AWS CLI — no long-lived credentials stored in ~/.docker/config.json.
production with a broken image — and the rollback you thought was pointing at the old digest is now pointing at the new broken one. Immutable tags force a unique version tag per release and make rollbacks deterministic.
ECR Lifecycle Policies automate cleanup. Without them, a repository that receives 20 pushes per day accumulates thousands of images and your storage bill grows unbounded. A lifecycle policy like the one below keeps the last 30 tagged releases and auto-expires untagged layers (leftover from failed builds):
Google Artifact Registry (GAR)
GAR is the successor to Google Container Registry (GCR) and is the right choice for GKE workloads. It supports multiple formats (Docker, Helm, npm, Maven) in a single service. Authentication uses Workload Identity Federation on GKE — no service account key files.
Harbor — Self-Hosted, Cloud-Agnostic
Harbor is the CNCF-graduated open-source registry that teams choose when they need on-premises storage (air-gapped environments, strict data residency, multi-cloud without coupling to one vendor). It adds a UI, RBAC, quotas, replication rules, and built-in Trivy scanning on top of the OCI Distribution Spec.
Pull-Through Caches
A pull-through cache is a registry proxy that sits between your nodes and an upstream registry (Docker Hub, Quay, or even your own private registry in a remote region). On the first pull, the proxy fetches and caches the image layer. Subsequent pulls are served locally — surviving upstream outages and rate limits.
ECR supports pull-through cache rules natively since 2022. The setup is a one-time Terraform or CLI operation:
harbor.internal/dockerhub-cache/library/nginx:1.27 instead of Docker Hub. Harbor respects the upstream cache-control headers and re-fetches when the upstream image is updated.
Image Retention and Lifecycle Best Practices
At big-tech scale, a single service with 10 pushes per day generates 3,650 images per year. Multiply by 200 services and you have 730,000 images — most of which have never been pulled after their initial deploy. Good lifecycle hygiene follows these rules:
- Tag semver releases explicitly (
v1.4.2), never deploy fromlatest. Lifecycle policies should only expire images tagged with a prefix that maps to ephemeral builds (e.g.,sha-orbranch-). - Retain a rolling window of recent releases (30–90 days, or last N versions) so rollback is always available without needing to rebuild.
- Expire untagged blobs aggressively (24–48 hours). Untagged images are orphaned push artifacts from failed or aborted CI runs — they consume storage and are never deployed.
- Block on critical CVEs at push time. ECR Enhanced Scanning (backed by AWS Inspector) and GAR's built-in scanning can be wired into CI to fail the pipeline when a CRITICAL vulnerability is found, before the image ever reaches production.
ImagePullBackOff. Keep at least the last-deployed digest pinned, or use digest-based tags (sha256:abc123) that you explicitly manage separately from the expiry window.
Authentication Patterns at Scale
Long-lived credentials in imagePullSecrets are a security liability — rotate them and you need to update every namespace. The production pattern is node-level IAM: on EKS, assign the node IAM role ECR read permissions; on GKE, use Workload Identity. The kubelet then authenticates transparently using the node's credential provider, and no secrets need to be managed at the application level.
For cross-account or cross-project pulls (e.g., a shared "golden image" registry accessed by many teams), use ECR cross-account policies or GAR IAM bindings that grant the consuming project's service account roles/artifactregistry.reader on the source project — no credentials, just IAM delegation.