CI/CD & GitOps Delivery
CI/CD & GitOps Delivery
The delivery pipeline is where platform investment becomes business velocity. At Google, a single engineer's commit can reach production in under an hour — automatically tested, scanned, built into an immutable artifact, promoted through environments, and progressively rolled out to a fraction of traffic before full release. That outcome is not accidental. It is the result of deliberate pipeline architecture, a GitOps control loop, and a progressive delivery strategy that lets teams ship confidently at 100+ deploys per day without burning out an on-call rotation.
This lesson traces the complete path from a developer's git push to a production canary, covering the engineering decisions that separate a real big-tech delivery system from a toy CI script.
Phase 1 — Pull Request and CI Gates
The pipeline starts before a line of code merges. The CI system must enforce a hard quality gate on the PR itself. At scale, this means running everything in parallel against a ephemeral, isolated environment — not sharing a single long-lived staging cluster that becomes a coordination bottleneck.
A production-grade CI job does four things in strict order: lint and static analysis, unit and integration tests with real service dependencies (Testcontainers or in-cluster ephemeral namespaces), security scanning (SAST via Semgrep, SCA via Trivy or Grype for dependency CVEs), and finally a container image build. Only if all four gates pass does the CI mark the PR as mergeable. Branch protection rules enforce this — no bypass, no override, not even for staff engineers.
latest, never a branch name. The SHA tag is deterministic and immutable: the same tag will never point to a different binary. This is a prerequisite for trustworthy GitOps promotion and meaningful rollback. Use Cosign keyless signing (OIDC-based, via GitHub Actions OIDC) so that any system can later verify which pipeline produced the image, without managing long-lived signing keys.
Phase 2 — GitOps Promotion Through Environments
Once the artifact is built and pushed, a CI job — not a human — opens a pull request against the GitOps config repository. This is the handoff point between the application team's world (code) and the platform's world (desired state). The config repo contains Kustomize overlays or Helm value files for each environment: dev/, staging/, production/. The PR bumps the image tag in the relevant overlay. Argo CD (or Flux) detects the merge and syncs the cluster.
The promotion model between environments is explicit and auditable. Dev auto-syncs on every merge. Staging promotion is triggered either automatically after dev smoke tests pass (for low-risk services) or via a manual approval step in the CI pipeline (for services with strict SLOs). Production promotion is always gated: an engineer approves the GitOps PR, Argo CD syncs, and an Argo Rollouts progressive delivery strategy takes over from there.
Phase 3 — Progressive Delivery with Argo Rollouts
A production deployment that flips 100% of traffic instantly is not a deployment strategy — it is a bet. Progressive delivery shifts the risk model: you expose the new version to a small percentage of real production traffic, verify key SLOs (error rate, p99 latency, business metrics via analysis templates), and either proceed or roll back automatically. No human needs to be awake at 03:00 watching dashboards.
Production Failure Modes You Will Actually Hit
At scale, these failure modes recur enough to be worth designing against from day one:
- Config-repo PR merge race. Two services promoted simultaneously both modify the same overlay file. Git conflict on automerge blocks both deployments. Mitigation: scope each service to its own Kustomize path; use
yqto target a precise field, not line-basedsed. - Argo CD sync storm. A change to the base Kustomize directory triggers a sync of all overlays simultaneously across 50 services. All 50 deployments start rolling at once, saturating cluster pod scheduling capacity. Mitigation: use
syncPolicy.automated.prune: falseinitially; gate bulk base changes behind a staged rollout manifest. - Image pull latency blocking rollout. A 1.2 GB Java fat-jar image takes 4 minutes to pull on a cold node, causing rollout pods to time out the readiness probe and triggering a rollback even though the application is healthy. Mitigation: enforce a <200 MB image size limit in CI (Trivy's image scan reports uncompressed size); use multi-stage builds aggressively; enable containerd image streaming (Stargz/eStargz) on your node class for large images.
- Analysis template false positive on p50/p95 metrics. Traffic is so low in canary (5% of 40 replicas = 2 pods) that statistical noise causes a 99.5% success-rate threshold to fail. Mitigation: use a minimum request count guard in your PromQL — do not evaluate the metric until at least 100 requests have been observed in the interval.
Scale Considerations and Pipeline Throughput
A team of 10 engineers might run 15 deploys per day. A platform serving 300 service teams will need 500+ deploys per day. The architectural differences are non-trivial. At 500 daily deploys, CI runner capacity becomes a first-order cost and latency concern. GitHub Actions hosted runners have cold-start latency of 30–60 seconds per job — multiply that across parallel jobs at scale and the queue time dominates pipeline duration. Large companies run self-hosted runner fleets (Actions Runner Controller on Kubernetes, or AWS CodeBuild-backed runners) to eliminate cold-start and control runner class (memory-optimized for integration tests, ARM Graviton for build cost savings).
Argo CD at scale requires careful ApplicationSet and App-of-Apps design. A single Argo CD instance can manage ~2,000 Applications before controller memory and API-server watch overhead becomes a problem. Beyond that, shard the Argo CD controllers using the ARGOCD_CONTROLLER_REPLICAS sharding mode, or split to multiple Argo CD instances federated via Argo CD ApplicationSets with a cluster generator.