Artifact Management & Release Engineering

Release Engineering as a Discipline

18 min Lesson 1 of 28

Release Engineering as a Discipline

At most early-stage companies, "releasing" means an engineer SSHes into a server and runs git pull. At Google, releasing the Search backend involves hundreds of engineers, dozens of automated gates, artifact promotion across five environments, cryptographic signing, coordinated rollbacks, and a dedicated team — the Release Engineering (RelEng) team — whose sole job is to make all of that invisible to product engineers. This lesson defines what release engineering is, what it owns, and why "build once, promote everywhere" is the foundational principle that makes safe delivery at scale possible.

What Release Engineering Actually Owns

Release engineering sits at the intersection of software development, operations, and security. It is often confused with CI/CD or with deployment, but it is broader than both. A mature RelEng team owns the entire lifecycle from the moment code merges to the moment a user receives the change.

Concretely, at a big-tech organisation, a release engineering team is responsible for:

Build system design and hermetic builds: defining how source code is compiled into binaries or container images, guaranteeing that the same source always produces the same artifact regardless of when or where the build runs.
Artifact management: where artifacts live (registries, blob stores), how they are versioned, how long they are retained, and who can access them.
Promotion pipelines: the automated machinery that takes an artifact from built to in front of users, gate by gate, environment by environment.
Version schemes and release cadence: defining what constitutes a release, how versions are numbered, and how frequently releases happen (per-commit continuous delivery vs. scheduled release trains).
Rollback and hotfix procedures: ensuring that every release can be undone quickly without manual intervention, and defining the escalation path for emergency patches.
Supply chain security: signing artifacts, verifying provenance (SLSA), managing build credentials, and preventing tampering between build and deployment.

RelEng is a force multiplier, not a bottleneck. The goal of a release engineering team is to make deployment so safe and boring that individual feature teams can release independently, frequently, and without fear. If teams are afraid to release on Fridays, or if releases require manual approval from a senior engineer, the RelEng function is not doing its job.

The Foundational Principle: Build Once, Promote Everywhere

The single most important concept in release engineering is deceptively simple: an artifact is built exactly once, and that identical artifact is promoted through every environment until it reaches production. You do not rebuild from source for staging. You do not rebuild for production. You take the binary or image that was validated in development, and you move it forward.

This principle solves a class of bugs that plagued software teams for decades: the "it works in staging but not in production" failure, caused by staging and production being built from subtly different source trees, at different times, with different versions of transitive dependencies. If you build twice, you ship twice — and you can never be certain the second build is identical to the first.

The "everywhere" in "build once, promote everywhere" refers to a promotion pipeline — a sequence of environments, each with increasing confidence gates, through which a single artifact travels. A typical big-tech promotion path looks like this:

The artifact is built once, stored with a content-addressed digest, and promoted through each environment. Every gate runs against the same binary.

Each gate in the promotion pipeline is an automated quality check. Gates fail fast and block promotion. Examples of gates at each stage:

Dev/CI: unit tests, linting, SAST (static application security testing), license scanning
Staging: integration tests, contract tests, load tests at a fraction of production traffic
Canary: real user traffic at 1–5%, automated comparison of error rates and latency percentiles (p50/p99) against the stable release
Production: progressive rollout (10% → 50% → 100%), SLO burn-rate monitors, automated rollback triggers

Why Release Engineering Exists as a Separate Function

In the early days of a company, a single senior engineer handles deployments. This is fine until you have ten teams all releasing independently, stepping on each other's deployment windows, causing cascading failures, and arguing about who broke production at 2 AM. At scale, coordination fails unless it is systematized.

Release engineering introduces consistency — every team uses the same pipeline, the same artifact format, the same version scheme, the same rollback procedure. This consistency has downstream benefits that compound over time:

Incident response is faster: every on-call engineer knows exactly which artifact is running in production (it is the one in the registry with the version tag deployed in your config repo), and exactly how to roll it back.
Compliance is automatable: artifact signing, SBOM (Software Bill of Materials) generation, and provenance attestation can be built into the pipeline once and applied to every release automatically.
Debugging cross-team issues is tractable: if a production incident involves services from three different teams, you can look up the exact artifact versions deployed at the time and trace back to the specific commits — because every artifact has a build provenance record.

Google SRE on release engineering: Google's SRE book dedicates an entire chapter to release engineering. Their key insight is that reliable releases require making the release process self-service for product teams while keeping the mechanics of releasing centrally maintained. Product teams define what to release and when; RelEng owns how. This division prevents each team from solving the same hard problems independently and badly.

Build Provenance: Knowing What You Shipped

A fundamental requirement of modern release engineering is provenance — the ability to answer, for any artifact in production, exactly: what source commit built it, on what CI runner, at what time, with what dependency versions. Without provenance, a supply chain attack (like the SolarWinds breach, where attackers injected malicious code into the build process itself) is undetectable after the fact.

The SLSA framework (Supply-chain Levels for Software Artifacts, pronounced "salsa") defines four levels of provenance assurance. Most production systems should target SLSA Level 3 or higher:

SLSA 1: provenance exists (a build log is generated)
SLSA 2: provenance is signed by the build service
SLSA 3: provenance is generated by a hardened, auditable build platform (e.g., GitHub Actions with OIDC signing, Google Cloud Build, Tekton Chains)
SLSA 4: hermetic, reproducible build; two-party review for all changes to the build process

At a practical level, SLSA 3 means your CI system generates a provenance attestation — a cryptographically signed JSON document that says "artifact sha256:abc123 was built from commit a7f3d91 in repository github.com/myorg/api by GitHub Actions workflow run 12345678". This attestation is stored alongside the artifact in the registry and verified at deployment time.

# Example: generate SLSA provenance with cosign + GitHub Actions OIDC
# In .github/workflows/release.yml

jobs:
  build:
    runs-on: ubuntu-24.04
    permissions:
      id-token: write      # required for OIDC provenance signing
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Build container image
        run: |
          docker build -t ghcr.io/myorg/api:${{ github.sha }} .
          docker push ghcr.io/myorg/api:${{ github.sha }}

      # slsa-github-generator produces a signed provenance attestation
      - uses: slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@v2.0.0
        with:
          image: ghcr.io/myorg/api
          digest: ${{ steps.build.outputs.digest }}

# Verify provenance at deployment time:
# cosign verify-attestation \
#   --type slsaprovenance \
#   --certificate-identity-regexp "https://github.com/myorg/api" \
#   --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
#   ghcr.io/myorg/api@sha256:abc123...

The Artifact Registry: Central Source of Truth for Deployable Things

Release engineering needs a place to store artifacts between build and deployment — a place that is versioned, access-controlled, retention-managed, and queryable. This is the artifact registry (or artifact repository). Different ecosystems use different storage formats, but the concept is identical:

Container images: Docker Registry v2 protocol — AWS ECR, Google Artifact Registry, GitHub Container Registry (GHCR), JFrog Artifactory, Harbor
JVM artifacts: Maven/Gradle — Nexus, Artifactory, GitHub Packages
Python packages: PyPI-compatible — Artifactory, AWS CodeArtifact, Google Artifact Registry
Helm charts: OCI registries or ChartMuseum
Generic binaries / tarballs: S3, GCS, Azure Blob Storage, Artifactory Generic repos

A well-operated registry enforces three policies automatically:

Immutability: once an artifact with a given version tag is pushed, it cannot be overwritten. If you need to change it, you publish a new version. (Mutable tags like latest are acceptable only as a convenience alias, never as a deployment identifier.)
Vulnerability scanning: all pushed images are scanned against CVE databases (Trivy, Grype, AWS Inspector) and flagged automatically. High-severity CVEs block promotion.
Retention policies: automatically delete artifacts older than N days or after N versions, except for artifacts tagged as release candidates or production releases, which are retained indefinitely or for the compliance period (typically 7 years in regulated industries).

# AWS ECR — set immutable tags and enable scanning on a new repository
aws ecr create-repository \
  --repository-name myorg/api \
  --image-tag-mutability IMMUTABLE \
  --image-scanning-configuration scanOnPush=true \
  --region us-east-1

# Apply a lifecycle policy: keep only the last 30 images on dev prefix
# Production (tag prefix "v") is exempt — kept forever
aws ecr put-lifecycle-policy \
  --repository-name myorg/api \
  --lifecycle-policy '{
    "rules": [
      {
        "rulePriority": 1,
        "description": "Retain last 30 dev builds",
        "selection": {
          "tagStatus": "tagged",
          "tagPrefixList": ["dev-", "sha-"],
          "countType": "imageCountMoreThan",
          "countNumber": 30
        },
        "action": {"type": "expire"}
      }
    ]
  }'

# List images with their digest and push date (useful for incident triage)
aws ecr describe-images \
  --repository-name myorg/api \
  --query 'imageDetails[*].{Tag:imageTags[0],Digest:imageDigest,Pushed:imagePushedAt}' \
  --output table \
  | sort -k4

Never use mutable image tags as deployment identifiers in production. Tagging an image :latest or :stable and referencing that tag in your Kubernetes manifests means a node restart or a pod rescheduling event can pull a completely different image than what you originally deployed — silently, with no audit trail. Always pin deployments to the immutable content-addressed digest (sha256:...) or to an immutable version tag like v1.4.2. Many teams use both: the version tag for human readability, the digest for enforcement in the manifest.

From Theory to Practice: What a Release Engineer Does Monday Morning

Abstracting release engineering to principles risks making it sound purely theoretical. In practice, a release engineer on Monday morning might:

Review the weekend's automated promotion run — check which services advanced from staging to canary, which failed their canary SLO checks and stayed back, and why.
Triage a CVE that Trivy flagged on a base image used by 12 services — determine severity, whether a patched base image is available, and coordinate an expedited rebuild.
Review a pull request from a product team asking to add a new integration test gate to their promotion pipeline.
Update the SLSA attestation verification policy in the deployment admission controller to start enforcing provenance on a new class of services that just went to production.
Run a chaos exercise: deliberately mark a production artifact as "quarantined" in the registry and verify that the deployment system refuses to promote it.

The discipline is concrete, hands-on, and deeply cross-cutting. Every other tutorial in this series — versioning, packaging, changelogs, hotfixes — feeds into this foundation. The next lesson dives into the first building block: Semantic Versioning and the release schemes used by different types of software at scale.