Project: A Policy-as-Code Framework
Project: A Policy-as-Code Framework
The previous nine lessons introduced the concepts, tools, and individual patterns of Policy as Code. This capstone lesson pulls them together into a layered guardrail framework — the kind you would design and present at a principal-engineer review before rolling it out across a multi-team organization. By the end you will have a blueprint covering three enforcement planes: the organization (cloud account level), the cluster (Kubernetes admission), and the pipeline (CI shift-left gates).
The core design principle is defense in depth for policy. No single layer is perfect, so every critical control is enforced at least twice at different points in the delivery lifecycle. A developer cannot bypass a pipeline gate by pushing to a feature branch, and they cannot bypass a cluster gate by pushing directly to the registry without also failing the org-level resource validation.
Layer 1 — Organizational Guardrails (Cloud Account Level)
At the organization layer, policies run in the cloud control plane itself. Nothing that fails these policies can be provisioned, regardless of what Terraform requests or what reaches the cluster. In AWS this is Service Control Policies (SCPs) attached at the Organizational Unit (OU) level. In GCP it is Organization Policy constraints. In Azure it is Azure Policy at the Management Group level.
The canonical SCP set every production OU should carry:
- Deny root account usage — condition on
aws:PrincipalArnmatching*:root, effect Deny across all actions. - Require MFA for console actions — Deny when
aws:MultiFactorAuthPresentis false for human IAM users. - Region lock — Deny actions whose
aws:RequestedRegionis not in your approved list. Prevents shadow workloads in unmonitored regions. - Deny disabling S3 block-public-access — Deny
s3:PutBucketPublicAccessBlockthat would turn off the block. - Require encryption at rest — Deny EBS volume and RDS instance creation when the encrypted flag is false.
Allow * does nothing on its own — it simply does not add a deny, so the IAM policy governs. Always model SCPs as guardrails, never as grants.
Layer 2 — Cluster Guardrails (Kubernetes Admission)
The cluster admission layer intercepts API server requests before they are persisted to etcd. This is where Gatekeeper (OPA) or Kyverno enforce container-level policy: no privilege escalation, no hostNetwork, required labels, approved image registries, and resource quota mandates. Because the control is in the admission webhook, it applies to every actor — human kubectl apply, Helm, ArgoCD, Flux, and CI bots alike.
A production cluster policy set should enforce at minimum:
- Approved registry — only images from your internal registry or a vetted public mirror are permitted. Prevents pulling from arbitrary Docker Hub repos with no image scanning.
- No privileged containers —
securityContext.privileged: trueis denied. Privileged containers effectively give root on the node. - Read-only root filesystem — forces write operations to explicitly declared volumes, making post-exploit persistence harder.
- Required labels —
app.kubernetes.io/name,app.kubernetes.io/version, andteammust be present. This makes cost attribution and incident scoping possible at scale. - Resource limits required — every container must specify
resources.limits.cpuandresources.limits.memory. Prevents a single runaway pod from starving a node.
Layer 3 — Pipeline Guardrails (CI Shift-Left Gates)
The pipeline layer is your cheapest and fastest feedback loop. Policy checks run in seconds during a pull request and block the merge if any control fails — long before an artifact reaches the cluster or the cloud account. This is where Conftest (OPA-powered, reads Rego) and Checkov / tfsec (Terraform static analysis) live.
Three gate categories that belong in every CI pipeline for infrastructure changes:
- Terraform plan policy — run Conftest against the Terraform plan JSON (
terraform show -json). Catch public S3 buckets, unencrypted volumes, overly-broad IAM policies beforeapply. - Container image scan — run Trivy or Grype against the built image. Fail on CRITICAL CVEs or on the presence of a root user in the image entrypoint.
- Kubernetes manifest lint — run Kyverno CLI or kubeconform against the rendered Helm/Kustomize manifests. The same policies you enforce at admission time should also run in CI against the repo, so developers get feedback locally.
The Layered Architecture in One View
Rollout Strategy — Audit Before Enforce
The most dangerous mistake when deploying a policy framework is enabling Enforce mode on day one across a live organization. You will break production. The correct rollout follows a three-phase pattern used at every large-scale deployment:
- Audit mode everywhere (week 1–2): deploy all Kyverno policies as
validationFailureAction: Auditand all SCPs asDenybut targeting only a sandbox OU. Collect violation logs from all three layers without blocking anything. - Fix and enforce in staging (week 2–4): analyze the audit findings, open remediation PRs for every violation, and switch staging to Enforce. Let one full sprint of normal development run to confirm zero false positives.
- Production rollout by OU (week 4+): enable Enforce one Organizational Unit at a time, starting with the least critical services. Monitor violation dashboards. Keep a break-glass process documented — which SCPs can be suspended, by whom, and for how long.
policy/ Git repository with its own CI pipeline and versioned releases. Reference policies in your cluster and CI by their semver tag, not by main. This makes it possible to roll back a policy change in minutes when it causes a false-positive block in production — the same way you roll back a broken container image.
Observability: Making the Policy Framework Visible
A policy framework that nobody can observe is a policy framework that will be turned off the first time it causes friction. Every layer must emit structured signals that feed your existing observability stack:
- Kyverno: emits Kubernetes Events and
PolicyReport/ClusterPolicyReportcustom resources. Scrape these with thepolicy-reporterexporter and visualize in Grafana. Build a dashboard showing violation count by policy, by namespace, and by team over time. - AWS Config Rules: non-compliant resource counts feed into CloudWatch metrics. Set an alarm on any compliance score below 100% for production accounts.
- CI gates: export gate pass/fail counts to your build analytics system. Track the mean-time-to-fix a policy violation just like you track mean-time-to-recovery for incidents.
exceptions/ directory with a YAML file per exception), require two approvers, set an expiry date, and auto-alert when the expiry is approaching. Treat an exception as technical debt that must be repaid.
Bringing It Together
You now have the complete blueprint: SCPs lock down the cloud control plane so no misconfigured resource can even exist; Kyverno policies at admission time ensure every workload that runs on your clusters meets your security and operational standards; and CI gates give developers fast feedback before their changes leave the pull-request stage. Policy violations in any layer are observable, attributed to a team, and feed back into a remediation workflow.
This is the same architecture used by security-mature engineering organizations running compliance-regulated workloads at scale. The tools change over time — Gatekeeper may give way to Kyverno, Checkov to a vendor alternative — but the three-layer pattern and the audit-before-enforce rollout discipline are durable engineering practices that will serve you through every toolchain evolution.