Capstone: A Big-Tech Production Platform

CI/CD & GitOps Delivery

18 min Lesson 5 of 30

CI/CD & GitOps Delivery

The delivery pipeline is where platform investment becomes business velocity. At Google, a single engineer's commit can reach production in under an hour — automatically tested, scanned, built into an immutable artifact, promoted through environments, and progressively rolled out to a fraction of traffic before full release. That outcome is not accidental. It is the result of deliberate pipeline architecture, a GitOps control loop, and a progressive delivery strategy that lets teams ship confidently at 100+ deploys per day without burning out an on-call rotation.

This lesson traces the complete path from a developer's git push to a production canary, covering the engineering decisions that separate a real big-tech delivery system from a toy CI script.

Phase 1 — Pull Request and CI Gates

The pipeline starts before a line of code merges. The CI system must enforce a hard quality gate on the PR itself. At scale, this means running everything in parallel against a ephemeral, isolated environment — not sharing a single long-lived staging cluster that becomes a coordination bottleneck.

A production-grade CI job does four things in strict order: lint and static analysis, unit and integration tests with real service dependencies (Testcontainers or in-cluster ephemeral namespaces), security scanning (SAST via Semgrep, SCA via Trivy or Grype for dependency CVEs), and finally a container image build. Only if all four gates pass does the CI mark the PR as mergeable. Branch protection rules enforce this — no bypass, no override, not even for staff engineers.

# .github/workflows/ci.yaml  (GitHub Actions, runs on every PR)
name: CI Gate

on:
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run golangci-lint
        uses: golangci/golangci-lint-action@v6
        with:
          version: v1.59

  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - name: Run tests with coverage
        run: go test -race -coverprofile=coverage.out ./...
      - name: Enforce 80% coverage floor
        run: |
          COVERAGE=$(go tool cover -func coverage.out | grep total | awk '{print $3}' | tr -d '%')
          if (( $(echo "$COVERAGE < 80" | bc -l) )); then echo "Coverage $COVERAGE% below 80%"; exit 1; fi

  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: SAST (Semgrep)
        uses: semgrep/semgrep-action@v1
        with:
          config: p/ci
      - name: SCA (Trivy filesystem scan)
        uses: aquasecurity/trivy-action@0.24.0
        with:
          scan-type: fs
          severity: HIGH,CRITICAL
          exit-code: 1

  build:
    needs: [lint, test, scan]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build & push image (signed)
        uses: docker/build-push-action@v6
        with:
          push: true
          tags: ${{ env.REGISTRY }}/myapp:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      - name: Sign image with Cosign
        run: cosign sign --yes ${{ env.REGISTRY }}/myapp:${{ github.sha }}

Immutable artifacts, commit-addressed tags. Every image is tagged with the full Git SHA — never latest, never a branch name. The SHA tag is deterministic and immutable: the same tag will never point to a different binary. This is a prerequisite for trustworthy GitOps promotion and meaningful rollback. Use Cosign keyless signing (OIDC-based, via GitHub Actions OIDC) so that any system can later verify which pipeline produced the image, without managing long-lived signing keys.

Phase 2 — GitOps Promotion Through Environments

Once the artifact is built and pushed, a CI job — not a human — opens a pull request against the GitOps config repository. This is the handoff point between the application team's world (code) and the platform's world (desired state). The config repo contains Kustomize overlays or Helm value files for each environment: dev/, staging/, production/. The PR bumps the image tag in the relevant overlay. Argo CD (or Flux) detects the merge and syncs the cluster.

Full delivery pipeline: from PR merge through CI gates and artifact registry to GitOps-driven environment promotion and production canary release.

The promotion model between environments is explicit and auditable. Dev auto-syncs on every merge. Staging promotion is triggered either automatically after dev smoke tests pass (for low-risk services) or via a manual approval step in the CI pipeline (for services with strict SLOs). Production promotion is always gated: an engineer approves the GitOps PR, Argo CD syncs, and an Argo Rollouts progressive delivery strategy takes over from there.

# config-repo structure (Kustomize)
k8s/
  base/
    deployment.yaml
    service.yaml
  overlays/
    dev/
      kustomization.yaml      # image tag: sha-abc123
    staging/
      kustomization.yaml      # image tag: sha-abc123  (promoted after dev gates)
    production/
      kustomization.yaml      # image tag: sha-def456  (previous stable release)
      rollout.yaml            # Argo Rollouts canary strategy

# CI job that opens the config-repo PR after artifact is built:
# (runs in the app-repo CI, separate job after 'build')
- name: Bump dev image tag
  run: |
    git clone https://x-access-token:${{ secrets.CONFIG_REPO_TOKEN }}@github.com/org/config-repo
    cd config-repo
    yq e -i '.images[0].newTag = "${{ github.sha }}"' k8s/overlays/dev/kustomization.yaml
    git config user.email "ci-bot@org.com"
    git config user.name "ci-bot"
    git checkout -b bump-dev-${{ github.sha }}
    git commit -am "chore: bump dev image to ${{ github.sha }}"
    gh pr create --title "Deploy ${{ github.sha }} to dev" --body "Auto-promotion from CI" --base main

Phase 3 — Progressive Delivery with Argo Rollouts

A production deployment that flips 100% of traffic instantly is not a deployment strategy — it is a bet. Progressive delivery shifts the risk model: you expose the new version to a small percentage of real production traffic, verify key SLOs (error rate, p99 latency, business metrics via analysis templates), and either proceed or roll back automatically. No human needs to be awake at 03:00 watching dashboards.

# k8s/overlays/production/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 40
  strategy:
    canary:
      canaryService: myapp-canary
      stableService: myapp-stable
      trafficRouting:
        istio:
          virtualService:
            name: myapp-vsvc
          destinationRule:
            name: myapp-destrule
            canarySubsetName: canary
            stableSubsetName: stable
      steps:
        - setWeight: 5          # 5% canary traffic for 5 min
        - pause: {duration: 5m}
        - analysis:             # automated SLO check
            templates:
              - templateName: success-rate
        - setWeight: 25
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-p99
        - setWeight: 100
      autoPromotionEnabled: false   # require explicit promotion in prod

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.995      # 99.5% success rate floor
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{job="myapp",status!~"5.."}[2m]))
            /
            sum(rate(http_requests_total{job="myapp"}[2m]))

Metric choice for analysis templates matters more than the canary percentage. Error rate alone is a lagging signal — by the time error rate spikes, users have already seen failures. Add a latency analysis template (p99 > 200ms triggers rollback), and for revenue-critical services, add a custom metric tracking business events (orders placed, checkout completions) via a counter in your application that Prometheus scrapes. A 15% drop in conversion rate in canary traffic is worth more signal than any infrastructure metric.

Production Failure Modes You Will Actually Hit

At scale, these failure modes recur enough to be worth designing against from day one:

Config-repo PR merge race. Two services promoted simultaneously both modify the same overlay file. Git conflict on automerge blocks both deployments. Mitigation: scope each service to its own Kustomize path; use yq to target a precise field, not line-based sed.
Argo CD sync storm. A change to the base Kustomize directory triggers a sync of all overlays simultaneously across 50 services. All 50 deployments start rolling at once, saturating cluster pod scheduling capacity. Mitigation: use syncPolicy.automated.prune: false initially; gate bulk base changes behind a staged rollout manifest.
Image pull latency blocking rollout. A 1.2 GB Java fat-jar image takes 4 minutes to pull on a cold node, causing rollout pods to time out the readiness probe and triggering a rollback even though the application is healthy. Mitigation: enforce a <200 MB image size limit in CI (Trivy's image scan reports uncompressed size); use multi-stage builds aggressively; enable containerd image streaming (Stargz/eStargz) on your node class for large images.
Analysis template false positive on p50/p95 metrics. Traffic is so low in canary (5% of 40 replicas = 2 pods) that statistical noise causes a 99.5% success-rate threshold to fail. Mitigation: use a minimum request count guard in your PromQL — do not evaluate the metric until at least 100 requests have been observed in the interval.

Never route canary traffic based on pod percentage alone in Istio/Envoy environments. Without an explicit VirtualService weight, Kubernetes' default round-robin sends traffic proportional to the number of ready endpoints, not your intended canary weight. If your stable Deployment has 38 pods and your canary has 2, you get roughly 5% — but a single pod crash in either subset swings the ratio materially. Always wire Argo Rollouts to an Istio VirtualService or an AWS ALB weighted target group so that traffic weight is an explicit, load-balancer-enforced setting independent of replica count.

Scale Considerations and Pipeline Throughput

A team of 10 engineers might run 15 deploys per day. A platform serving 300 service teams will need 500+ deploys per day. The architectural differences are non-trivial. At 500 daily deploys, CI runner capacity becomes a first-order cost and latency concern. GitHub Actions hosted runners have cold-start latency of 30–60 seconds per job — multiply that across parallel jobs at scale and the queue time dominates pipeline duration. Large companies run self-hosted runner fleets (Actions Runner Controller on Kubernetes, or AWS CodeBuild-backed runners) to eliminate cold-start and control runner class (memory-optimized for integration tests, ARM Graviton for build cost savings).

Argo CD at scale requires careful ApplicationSet and App-of-Apps design. A single Argo CD instance can manage ~2,000 Applications before controller memory and API-server watch overhead becomes a problem. Beyond that, shard the Argo CD controllers using the ARGOCD_CONTROLLER_REPLICAS sharding mode, or split to multiple Argo CD instances federated via Argo CD ApplicationSets with a cluster generator.