Capstone: A Big-Tech Production Platform

CI/CD & GitOps Delivery

18 min Lesson 5 of 30

CI/CD & GitOps Delivery

The delivery pipeline is where platform investment becomes business velocity. At Google, a single engineer's commit can reach production in under an hour — automatically tested, scanned, built into an immutable artifact, promoted through environments, and progressively rolled out to a fraction of traffic before full release. That outcome is not accidental. It is the result of deliberate pipeline architecture, a GitOps control loop, and a progressive delivery strategy that lets teams ship confidently at 100+ deploys per day without burning out an on-call rotation.

This lesson traces the complete path from a developer's git push to a production canary, covering the engineering decisions that separate a real big-tech delivery system from a toy CI script.

Phase 1 — Pull Request and CI Gates

The pipeline starts before a line of code merges. The CI system must enforce a hard quality gate on the PR itself. At scale, this means running everything in parallel against a ephemeral, isolated environment — not sharing a single long-lived staging cluster that becomes a coordination bottleneck.

A production-grade CI job does four things in strict order: lint and static analysis, unit and integration tests with real service dependencies (Testcontainers or in-cluster ephemeral namespaces), security scanning (SAST via Semgrep, SCA via Trivy or Grype for dependency CVEs), and finally a container image build. Only if all four gates pass does the CI mark the PR as mergeable. Branch protection rules enforce this — no bypass, no override, not even for staff engineers.

# .github/workflows/ci.yaml (GitHub Actions, runs on every PR) name: CI Gate on: pull_request: branches: [main] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run golangci-lint uses: golangci/golangci-lint-action@v6 with: version: v1.59 test: runs-on: ubuntu-latest services: postgres: image: postgres:16 env: POSTGRES_PASSWORD: test options: >- --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5 steps: - uses: actions/checkout@v4 - name: Run tests with coverage run: go test -race -coverprofile=coverage.out ./... - name: Enforce 80% coverage floor run: | COVERAGE=$(go tool cover -func coverage.out | grep total | awk '{print $3}' | tr -d '%') if (( $(echo "$COVERAGE < 80" | bc -l) )); then echo "Coverage $COVERAGE% below 80%"; exit 1; fi scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: SAST (Semgrep) uses: semgrep/semgrep-action@v1 with: config: p/ci - name: SCA (Trivy filesystem scan) uses: aquasecurity/trivy-action@0.24.0 with: scan-type: fs severity: HIGH,CRITICAL exit-code: 1 build: needs: [lint, test, scan] runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Build & push image (signed) uses: docker/build-push-action@v6 with: push: true tags: ${{ env.REGISTRY }}/myapp:${{ github.sha }} cache-from: type=gha cache-to: type=gha,mode=max - name: Sign image with Cosign run: cosign sign --yes ${{ env.REGISTRY }}/myapp:${{ github.sha }}
Immutable artifacts, commit-addressed tags. Every image is tagged with the full Git SHA — never latest, never a branch name. The SHA tag is deterministic and immutable: the same tag will never point to a different binary. This is a prerequisite for trustworthy GitOps promotion and meaningful rollback. Use Cosign keyless signing (OIDC-based, via GitHub Actions OIDC) so that any system can later verify which pipeline produced the image, without managing long-lived signing keys.

Phase 2 — GitOps Promotion Through Environments

Once the artifact is built and pushed, a CI job — not a human — opens a pull request against the GitOps config repository. This is the handoff point between the application team's world (code) and the platform's world (desired state). The config repo contains Kustomize overlays or Helm value files for each environment: dev/, staging/, production/. The PR bumps the image tag in the relevant overlay. Argo CD (or Flux) detects the merge and syncs the cluster.

CI/CD & GitOps Pipeline — PR to Production PR / Push app-repo CI Gates lint / test SAST / SCA build + sign Artifact OCI image sha256 tag Cosign sig GitOps PR bump tag in config-repo dev overlay Argo CD sync dev → staging → prod canary merge parallel jobs registry push auto-open PR reconcile loop dev staging prod (canary)
Full delivery pipeline: from PR merge through CI gates and artifact registry to GitOps-driven environment promotion and production canary release.

The promotion model between environments is explicit and auditable. Dev auto-syncs on every merge. Staging promotion is triggered either automatically after dev smoke tests pass (for low-risk services) or via a manual approval step in the CI pipeline (for services with strict SLOs). Production promotion is always gated: an engineer approves the GitOps PR, Argo CD syncs, and an Argo Rollouts progressive delivery strategy takes over from there.

# config-repo structure (Kustomize) k8s/ base/ deployment.yaml service.yaml overlays/ dev/ kustomization.yaml # image tag: sha-abc123 staging/ kustomization.yaml # image tag: sha-abc123 (promoted after dev gates) production/ kustomization.yaml # image tag: sha-def456 (previous stable release) rollout.yaml # Argo Rollouts canary strategy # CI job that opens the config-repo PR after artifact is built: # (runs in the app-repo CI, separate job after 'build') - name: Bump dev image tag run: | git clone https://x-access-token:${{ secrets.CONFIG_REPO_TOKEN }}@github.com/org/config-repo cd config-repo yq e -i '.images[0].newTag = "${{ github.sha }}"' k8s/overlays/dev/kustomization.yaml git config user.email "ci-bot@org.com" git config user.name "ci-bot" git checkout -b bump-dev-${{ github.sha }} git commit -am "chore: bump dev image to ${{ github.sha }}" gh pr create --title "Deploy ${{ github.sha }} to dev" --body "Auto-promotion from CI" --base main

Phase 3 — Progressive Delivery with Argo Rollouts

A production deployment that flips 100% of traffic instantly is not a deployment strategy — it is a bet. Progressive delivery shifts the risk model: you expose the new version to a small percentage of real production traffic, verify key SLOs (error rate, p99 latency, business metrics via analysis templates), and either proceed or roll back automatically. No human needs to be awake at 03:00 watching dashboards.

# k8s/overlays/production/rollout.yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: myapp spec: replicas: 40 strategy: canary: canaryService: myapp-canary stableService: myapp-stable trafficRouting: istio: virtualService: name: myapp-vsvc destinationRule: name: myapp-destrule canarySubsetName: canary stableSubsetName: stable steps: - setWeight: 5 # 5% canary traffic for 5 min - pause: {duration: 5m} - analysis: # automated SLO check templates: - templateName: success-rate - setWeight: 25 - pause: {duration: 10m} - analysis: templates: - templateName: success-rate - templateName: latency-p99 - setWeight: 100 autoPromotionEnabled: false # require explicit promotion in prod --- apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: success-rate spec: metrics: - name: success-rate interval: 1m successCondition: result[0] >= 0.995 # 99.5% success rate floor failureLimit: 2 provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{job="myapp",status!~"5.."}[2m])) / sum(rate(http_requests_total{job="myapp"}[2m]))
Metric choice for analysis templates matters more than the canary percentage. Error rate alone is a lagging signal — by the time error rate spikes, users have already seen failures. Add a latency analysis template (p99 > 200ms triggers rollback), and for revenue-critical services, add a custom metric tracking business events (orders placed, checkout completions) via a counter in your application that Prometheus scrapes. A 15% drop in conversion rate in canary traffic is worth more signal than any infrastructure metric.

Production Failure Modes You Will Actually Hit

At scale, these failure modes recur enough to be worth designing against from day one:

  • Config-repo PR merge race. Two services promoted simultaneously both modify the same overlay file. Git conflict on automerge blocks both deployments. Mitigation: scope each service to its own Kustomize path; use yq to target a precise field, not line-based sed.
  • Argo CD sync storm. A change to the base Kustomize directory triggers a sync of all overlays simultaneously across 50 services. All 50 deployments start rolling at once, saturating cluster pod scheduling capacity. Mitigation: use syncPolicy.automated.prune: false initially; gate bulk base changes behind a staged rollout manifest.
  • Image pull latency blocking rollout. A 1.2 GB Java fat-jar image takes 4 minutes to pull on a cold node, causing rollout pods to time out the readiness probe and triggering a rollback even though the application is healthy. Mitigation: enforce a <200 MB image size limit in CI (Trivy's image scan reports uncompressed size); use multi-stage builds aggressively; enable containerd image streaming (Stargz/eStargz) on your node class for large images.
  • Analysis template false positive on p50/p95 metrics. Traffic is so low in canary (5% of 40 replicas = 2 pods) that statistical noise causes a 99.5% success-rate threshold to fail. Mitigation: use a minimum request count guard in your PromQL — do not evaluate the metric until at least 100 requests have been observed in the interval.
Never route canary traffic based on pod percentage alone in Istio/Envoy environments. Without an explicit VirtualService weight, Kubernetes' default round-robin sends traffic proportional to the number of ready endpoints, not your intended canary weight. If your stable Deployment has 38 pods and your canary has 2, you get roughly 5% — but a single pod crash in either subset swings the ratio materially. Always wire Argo Rollouts to an Istio VirtualService or an AWS ALB weighted target group so that traffic weight is an explicit, load-balancer-enforced setting independent of replica count.

Scale Considerations and Pipeline Throughput

A team of 10 engineers might run 15 deploys per day. A platform serving 300 service teams will need 500+ deploys per day. The architectural differences are non-trivial. At 500 daily deploys, CI runner capacity becomes a first-order cost and latency concern. GitHub Actions hosted runners have cold-start latency of 30–60 seconds per job — multiply that across parallel jobs at scale and the queue time dominates pipeline duration. Large companies run self-hosted runner fleets (Actions Runner Controller on Kubernetes, or AWS CodeBuild-backed runners) to eliminate cold-start and control runner class (memory-optimized for integration tests, ARM Graviton for build cost savings).

Argo CD at scale requires careful ApplicationSet and App-of-Apps design. A single Argo CD instance can manage ~2,000 Applications before controller memory and API-server watch overhead becomes a problem. Beyond that, shard the Argo CD controllers using the ARGOCD_CONTROLLER_REPLICAS sharding mode, or split to multiple Argo CD instances federated via Argo CD ApplicationSets with a cluster generator.