Compliance & Policy as Code

Policy as Code Concepts

18 min Lesson 3 of 27

Policy as Code Concepts

Every organization has rules: no container runs as root, every S3 bucket must have encryption, every Kubernetes Deployment must declare resource limits, every IAM role needs a usage justification tag. Traditionally these rules lived in PDFs, wikis, and tribal knowledge — enforced (or not) during quarterly audits. Policy as Code is the practice of expressing those rules as machine-readable, version-controlled, testable code that runs automatically in the same pipeline as your application. The result is compliance that operates at the speed of CI/CD instead of the speed of an audit cycle.

The core shift: Policy as Code moves enforcement from a human checklist after the fact to an automated gate before the fact. Rules are expressed in a formal language, stored in git, reviewed via pull request, and executed by a runtime that can enforce, detect, or remediate — automatically.

The Three Operational Modes: Prevent, Detect, Remediate

Any policy enforcement system operates in one or more of three modes. Understanding the tradeoffs of each is essential before you choose tooling or write a single rule.

Prevent (Admission Control)

The policy engine sits in the critical path of a write operation — a kubectl apply, a Terraform plan apply, a pull request merge, a CloudFormation stack update — and rejects non-compliant resources before they ever exist. Nothing violating the policy can be created. This is the highest-value mode: it eliminates entire classes of incidents by making the violation impossible.

The cost of prevention is latency and blast radius. A misconfigured policy that rejects valid resources causes deployment failures and on-call escalations. At Google and Meta, admission webhooks are deployed in dry-run mode first, with metrics on would-have-rejected counts, before being set to enforce. A single overly broad prevent rule can block every deployment across a cluster of thousands of teams.

Detect (Continuous Audit)

The policy engine runs against the current state of your environment — periodically or on-demand — and reports violations without blocking anything. Resources that already exist and violate policy are surfaced in a dashboard or routed to an alert. Detection is appropriate for legacy environments where you cannot yet prevent violations (too many existing exceptions to enumerate) and for policies that are aspirational rather than hard requirements.

Detection without a remediation path is just expensive reporting. Every finding that sits unresolved for more than a sprint degrades the signal-to-noise ratio of your compliance dashboard until engineers stop looking at it. Detection must always feed into a workflow.

Remediate (Auto-Correction)

When a violation is detected, the system automatically mutates the resource back into compliance — patching a missing label, enabling encryption on a storage bucket, quarantining a non-compliant node. This is the most powerful mode and the most dangerous. Automated mutation in production requires extraordinary confidence in the policy logic and a kill switch: a way to disable remediation when it starts a feedback loop.

A well-known failure mode: a remediation controller that patches a resource triggers a reconciliation loop in the application controller that reverts the patch, which triggers the remediation controller again — a tight CPU-burning loop that floods the Kubernetes API server. Always enforce an exponential backoff and a circuit breaker on any remediation controller.

The compliance enforcement loop: prevention gates write operations before resources are created; detection continuously audits live state; remediation corrects violations automatically and feeds back into live state.

Codifying Rules: What a Policy Actually Is

A policy is a boolean function over an input document. The input is the resource being evaluated — a Kubernetes manifest, a Terraform plan, a CloudFormation template, an IAM role definition, a container image layer list. The output is a decision: allow or deny, with a human-readable reason. Everything else — the runtime, the language, the integration point — is infrastructure for executing that function at the right moment.

A well-codified rule has four components:

Scope: what resource types and namespaces does this rule apply to?
Condition: what attribute or combination of attributes triggers a violation?
Action: prevent, warn, or remediate?
Message: a human-readable explanation that tells the engineer exactly what to fix and why.

Write policies as deny rules, not allow lists. An allow list requires you to enumerate every valid configuration — impossible at scale. A deny rule captures the specific thing that is forbidden. The default posture is permissive; policies selectively tighten. This mirrors how mature firewall rule sets work.

A Concrete Example: Prevent + Detect + Remediate for a Single Rule

Consider the rule: "No Kubernetes Pod may run with privileged: true." Here is how each mode implements it:

# ── PREVENT mode: OPA Rego policy (evaluated by Gatekeeper webhook) ──
# File: policy/privileged-container.rego

package k8scontainerprivileged

violation[{"msg": msg}] {
  c := input.review.object.spec.containers[_]
  c.securityContext.privileged == true
  msg := sprintf("Container %v must not run as privileged", [c.name])
}

violation[{"msg": msg}] {
  c := input.review.object.spec.initContainers[_]
  c.securityContext.privileged == true
  msg := sprintf("Init container %v must not run as privileged", [c.name])
}

# ── DETECT mode: Kyverno ClusterPolicy in audit mode ──
# File: policy/disallow-privileged-audit.yaml

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged-containers
  annotations:
    policies.kyverno.io/description: Detect Pods with privileged containers.
spec:
  validationFailureAction: Audit   # report-only; change to Enforce to prevent
  rules:
    - name: check-privileged
      match:
        resources:
          kinds: [Pod]
      validate:
        message: "Privileged containers are not allowed."
        pattern:
          spec:
            containers:
              - securityContext:
                  privileged: "false | null"

# ── REMEDIATE mode: Kyverno mutating policy (sets privileged=false if absent) ──
# File: policy/default-nonprivileged-mutate.yaml

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: set-nonprivileged-default
spec:
  rules:
    - name: set-privileged-false
      match:
        resources:
          kinds: [Pod]
      mutate:
        patchStrategicMerge:
          spec:
            containers:
              - (name): "*"
                securityContext:
                  +(privileged): false   # only set if key is absent

# WARNING: mutation is not a substitute for education.
# Silently fixing a misconfiguration hides the root cause.
# Use mutation only for defaults you own; use prevention for security boundaries.

The Policy Lifecycle: Shift Left Across the SDLC

The most effective policy implementations operate at multiple stages simultaneously, catching violations as early as possible. The further right a violation is caught (production vs. local IDE), the more expensive it is to fix — by orders of magnitude.

IDE / pre-commit: conftest, kube-linter, tfsec run locally and on every commit. Instant feedback, no infrastructure required. Catches the majority of common mistakes before they ever enter a PR.
CI pipeline: The same tools run as required checks in GitHub Actions, GitLab CI, or Jenkins. The pipeline fails and the PR cannot merge. This is the primary enforcement gate for IaC.
Admission webhook (runtime): Gatekeeper or Kyverno blocks non-compliant kubectl apply calls even if CI was bypassed (manual kubectl by an SRE, a Helm chart installed directly). Defense in depth.
Continuous audit: A reconciliation loop compares the live state of every resource against every policy and emits findings to a SIEM or compliance dashboard. Catches configuration drift, manual changes, and violations that predate the policy.

The "audit forever" anti-pattern: Many organizations deploy policies in Audit mode to avoid disruption, then never graduate them to Enforce. Audit findings accumulate in a dashboard that nobody is accountable for. Treat every audit-mode policy as having a graduation deadline — 30 or 60 days — after which it must either be enforced or explicitly deferred with a documented exception. Policy debt compounds exactly like technical debt.

Policy as Code Requires Policy Testing

A policy that has never been tested is just structured documentation. Every policy file must have a companion test suite that exercises both the allow path (valid resources that must pass) and the deny path (invalid resources that must be rejected). OPA ships the opa test command; Kyverno has kyverno test. Run these in CI as part of your policy repository's pipeline — breaking a policy test is a build failure just like breaking an application unit test.

# ── OPA unit test for the privileged-container policy ──
# File: policy/privileged-container_test.rego

package k8scontainerprivileged

# Should PASS (non-privileged container — no violation expected)
test_nonprivileged_allowed {
  count(violation) == 0 with input as {
    "review": {"object": {"spec": {"containers": [
      {"name": "app", "securityContext": {"privileged": false}}
    ]}}}
  }
}

# Should DENY (privileged container — violation must be raised)
test_privileged_denied {
  count(violation) == 1 with input as {
    "review": {"object": {"spec": {"containers": [
      {"name": "app", "securityContext": {"privileged": true}}
    ]}}}
  }
}

# Run with: opa test ./policy/ -v