Compliance & Policy as Code

Project: A Policy-as-Code Framework

18 min Lesson 10 of 27

Project: A Policy-as-Code Framework

The previous nine lessons introduced the concepts, tools, and individual patterns of Policy as Code. This capstone lesson pulls them together into a layered guardrail framework — the kind you would design and present at a principal-engineer review before rolling it out across a multi-team organization. By the end you will have a blueprint covering three enforcement planes: the organization (cloud account level), the cluster (Kubernetes admission), and the pipeline (CI shift-left gates).

The core design principle is defense in depth for policy. No single layer is perfect, so every critical control is enforced at least twice at different points in the delivery lifecycle. A developer cannot bypass a pipeline gate by pushing to a feature branch, and they cannot bypass a cluster gate by pushing directly to the registry without also failing the org-level resource validation.

Layer 1 — Organizational Guardrails (Cloud Account Level)

At the organization layer, policies run in the cloud control plane itself. Nothing that fails these policies can be provisioned, regardless of what Terraform requests or what reaches the cluster. In AWS this is Service Control Policies (SCPs) attached at the Organizational Unit (OU) level. In GCP it is Organization Policy constraints. In Azure it is Azure Policy at the Management Group level.

The canonical SCP set every production OU should carry:

Deny root account usage — condition on aws:PrincipalArn matching *:root, effect Deny across all actions.
Require MFA for console actions — Deny when aws:MultiFactorAuthPresent is false for human IAM users.
Region lock — Deny actions whose aws:RequestedRegion is not in your approved list. Prevents shadow workloads in unmonitored regions.
Deny disabling S3 block-public-access — Deny s3:PutBucketPublicAccessBlock that would turn off the block.
Require encryption at rest — Deny EBS volume and RDS instance creation when the encrypted flag is false.

# scp-deny-unencrypted-storage.json
# Attach at the Production OU via AWS Organizations → Policies
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnencryptedEBS",
      "Effect": "Deny",
      "Action": ["ec2:CreateVolume", "ec2:RunInstances"],
      "Resource": "*",
      "Condition": {
        "Bool": { "ec2:Encrypted": "false" }
      }
    },
    {
      "Sid": "DenyUnencryptedRDS",
      "Effect": "Deny",
      "Action": "rds:CreateDBInstance",
      "Resource": "*",
      "Condition": {
        "Bool": { "rds:StorageEncrypted": "false" }
      }
    },
    {
      "Sid": "DenyNonApprovedRegions",
      "Effect": "Deny",
      "NotAction": [
        "iam:*", "sts:*", "route53:*",
        "cloudfront:*", "waf:*", "support:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": ["us-east-1", "us-west-2", "eu-west-1"]
        }
      }
    }
  ]
}

SCPs are a deny-only mechanism that intersects with IAM policies. They do not grant permissions. An SCP that says Allow * does nothing on its own — it simply does not add a deny, so the IAM policy governs. Always model SCPs as guardrails, never as grants.

Layer 2 — Cluster Guardrails (Kubernetes Admission)

The cluster admission layer intercepts API server requests before they are persisted to etcd. This is where Gatekeeper (OPA) or Kyverno enforce container-level policy: no privilege escalation, no hostNetwork, required labels, approved image registries, and resource quota mandates. Because the control is in the admission webhook, it applies to every actor — human kubectl apply, Helm, ArgoCD, Flux, and CI bots alike.

A production cluster policy set should enforce at minimum:

Approved registry — only images from your internal registry or a vetted public mirror are permitted. Prevents pulling from arbitrary Docker Hub repos with no image scanning.
No privileged containers — securityContext.privileged: true is denied. Privileged containers effectively give root on the node.
Read-only root filesystem — forces write operations to explicitly declared volumes, making post-exploit persistence harder.
Required labels — app.kubernetes.io/name, app.kubernetes.io/version, and team must be present. This makes cost attribution and incident scoping possible at scale.
Resource limits required — every container must specify resources.limits.cpu and resources.limits.memory. Prevents a single runaway pod from starving a node.

# kyverno-require-labels.yaml
# ClusterPolicy that blocks workloads missing mandatory labels
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-team-labels
  annotations:
    policies.kyverno.io/title: Require Team Labels
    policies.kyverno.io/severity: medium
    policies.kyverno.io/description: >
      Deployments, StatefulSets and DaemonSets must carry
      the 'team' label for cost attribution and on-call routing.
spec:
  validationFailureAction: Enforce   # Audit mode first in staging; Enforce in prod
  background: true
  rules:
    - name: check-team-label
      match:
        any:
          - resources:
              kinds: [Deployment, StatefulSet, DaemonSet]
      validate:
        message: >
          Resource must have label 'team' set to a known value.
          Add: metadata.labels.team: <your-team-slug>
        pattern:
          metadata:
            labels:
              team: "?*"   # non-empty string

---
# kyverno-approved-registry.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registry
spec:
  validationFailureAction: Enforce
  background: false   # Admission only — existing pods are grandfathered
  rules:
    - name: validate-registry
      match:
        any:
          - resources:
              kinds: [Pod]
      validate:
        message: "Images must be sourced from registry.example.internal or public.ecr.aws"
        pattern:
          spec:
            containers:
              - image: "registry.example.internal/* | public.ecr.aws/*"

Layer 3 — Pipeline Guardrails (CI Shift-Left Gates)

The pipeline layer is your cheapest and fastest feedback loop. Policy checks run in seconds during a pull request and block the merge if any control fails — long before an artifact reaches the cluster or the cloud account. This is where Conftest (OPA-powered, reads Rego) and Checkov / tfsec (Terraform static analysis) live.

Three gate categories that belong in every CI pipeline for infrastructure changes:

Terraform plan policy — run Conftest against the Terraform plan JSON (terraform show -json). Catch public S3 buckets, unencrypted volumes, overly-broad IAM policies before apply.
Container image scan — run Trivy or Grype against the built image. Fail on CRITICAL CVEs or on the presence of a root user in the image entrypoint.
Kubernetes manifest lint — run Kyverno CLI or kubeconform against the rendered Helm/Kustomize manifests. The same policies you enforce at admission time should also run in CI against the repo, so developers get feedback locally.

# .github/workflows/policy-gates.yml
name: Policy-as-Code Gates

on:
  pull_request:
    paths:
      - 'infra/**'
      - 'k8s/**'
      - 'Dockerfile'

jobs:
  terraform-policy:
    name: Terraform Plan + Conftest
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/TerraformPlanReadOnly
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Init & Plan
        run: |
          terraform init -input=false
          terraform plan -out=tfplan.binary -input=false
          terraform show -json tfplan.binary > tfplan.json
        working-directory: ./infra

      - name: Conftest — Terraform Policy
        run: |
          docker run --rm \
            -v "$PWD/infra:/project" \
            -v "$PWD/policy:/policy" \
            openpolicyagent/conftest:latest \
            test /project/tfplan.json \
            --policy /policy/terraform \
            --namespace terraform
        # Exit non-zero = policy violation = PR blocked

  image-scan:
    name: Container Image Scan (Trivy)
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build image
        run: docker build -t app:${{ github.sha }} .

      - name: Trivy — CRITICAL CVEs
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: app:${{ github.sha }}
          severity: CRITICAL
          exit-code: '1'       # Fail build on any CRITICAL
          ignore-unfixed: true

  k8s-manifest-policy:
    name: Kyverno CLI — Manifest Policy
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Render Helm manifests
        run: |
          helm template my-app ./charts/my-app \
            --values ./charts/my-app/values-prod.yaml \
            > /tmp/rendered.yaml

      - name: Kyverno CLI apply
        run: |
          kyverno apply ./policy/kyverno/ \
            --resource /tmp/rendered.yaml \
            --detailed-results
        # Kyverno CLI exits 1 if any ENFORCE policy fails

The Layered Architecture in One View

Three-layer guardrail framework: org SCPs block bad cloud resources, cluster admission blocks bad workloads, and pipeline gates block bad code before it ships.

Rollout Strategy — Audit Before Enforce

The most dangerous mistake when deploying a policy framework is enabling Enforce mode on day one across a live organization. You will break production. The correct rollout follows a three-phase pattern used at every large-scale deployment:

Audit mode everywhere (week 1–2): deploy all Kyverno policies as validationFailureAction: Audit and all SCPs as Deny but targeting only a sandbox OU. Collect violation logs from all three layers without blocking anything.
Fix and enforce in staging (week 2–4): analyze the audit findings, open remediation PRs for every violation, and switch staging to Enforce. Let one full sprint of normal development run to confirm zero false positives.
Production rollout by OU (week 4+): enable Enforce one Organizational Unit at a time, starting with the least critical services. Monitor violation dashboards. Keep a break-glass process documented — which SCPs can be suspended, by whom, and for how long.

Store all your Rego bundles, Kyverno manifests, and SCP JSON files in a dedicated policy/ Git repository with its own CI pipeline and versioned releases. Reference policies in your cluster and CI by their semver tag, not by main. This makes it possible to roll back a policy change in minutes when it causes a false-positive block in production — the same way you roll back a broken container image.

Observability: Making the Policy Framework Visible

A policy framework that nobody can observe is a policy framework that will be turned off the first time it causes friction. Every layer must emit structured signals that feed your existing observability stack:

Kyverno: emits Kubernetes Events and PolicyReport / ClusterPolicyReport custom resources. Scrape these with the policy-reporter exporter and visualize in Grafana. Build a dashboard showing violation count by policy, by namespace, and by team over time.
AWS Config Rules: non-compliant resource counts feed into CloudWatch metrics. Set an alarm on any compliance score below 100% for production accounts.
CI gates: export gate pass/fail counts to your build analytics system. Track the mean-time-to-fix a policy violation just like you track mean-time-to-recovery for incidents.

Policy exceptions need governance too. Every framework eventually has a request to exempt a workload from a policy. Without a formal exception process, the exception list grows without bound and the policy becomes meaningless. Document exceptions in Git (an exceptions/ directory with a YAML file per exception), require two approvers, set an expiry date, and auto-alert when the expiry is approaching. Treat an exception as technical debt that must be repaid.

Bringing It Together

You now have the complete blueprint: SCPs lock down the cloud control plane so no misconfigured resource can even exist; Kyverno policies at admission time ensure every workload that runs on your clusters meets your security and operational standards; and CI gates give developers fast feedback before their changes leave the pull-request stage. Policy violations in any layer are observable, attributed to a team, and feed back into a remediation workflow.

This is the same architecture used by security-mature engineering organizations running compliance-regulated workloads at scale. The tools change over time — Gatekeeper may give way to Kyverno, Checkov to a vendor alternative — but the three-layer pattern and the audit-before-enforce rollout discipline are durable engineering practices that will serve you through every toolchain evolution.