Capstone: A Big-Tech Production Platform

Security & Compliance Layer

18 min Lesson 7 of 30

Security & Compliance Layer

Arctiq Commerce carries PCI-DSS Level 2 scope and GDPR obligations. A single leaked secret or an unpatched container image that reaches production is not an embarrassment — it is a reportable breach with regulatory fines, customer churn, and potential criminal liability. At this scale, security cannot be a checklist applied at the end of a sprint; it must be a property of the platform that engineering teams inherit automatically. This lesson builds that property layer by layer: secrets management, supply-chain security, admission control, runtime enforcement, and compliance posture.

Secrets Management: HashiCorp Vault at Scale

The most common source of credential leaks in Kubernetes environments is secrets stored in plaintext in Git or baked into container images. Neither kubectl create secret nor sealed-secrets is sufficient at big-tech scale because they do not provide dynamic credential issuance, fine-grained lease management, or audit trails per-service. Vault is the industry standard answer.

Arctiq runs Vault in HA mode on three dedicated EC2 instances (not in Kubernetes — this avoids a circular dependency where secrets are needed to bootstrap the cluster that runs Vault). Storage backend is DynamoDB with point-in-time recovery enabled. Auto-unseal uses AWS KMS so Vault pods restart cleanly after a node failure without human intervention.

# vault-values.yaml (Helm chart for the Vault Agent Injector sidecar)
injector:
  enabled: true
  replicas: 2
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 250m
      memory: 128Mi

# Enable Kubernetes auth in Vault (run once after cluster creation)
vault auth enable kubernetes

vault write auth/kubernetes/config \
  kubernetes_host="https://$(kubectl get svc kubernetes -o jsonpath='{.spec.clusterIP}'):443" \
  kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
  token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"

# Dynamic database credentials — 15-minute TTL
vault secrets enable database

vault write database/config/arctiq-aurora \
  plugin_name=postgresql-database-plugin \
  allowed_roles="app-readonly,app-readwrite" \
  connection_url="postgresql://{{username}}:{{password}}@aurora-primary.cluster.internal:5432/arctiq" \
  username="vault-root" \
  password="$VAULT_DB_ROOT_PASS"

vault write database/roles/app-readwrite \
  db_name=arctiq-aurora \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  default_ttl="15m" \
  max_ttl="1h"

Application pods use the Vault Agent Injector via annotations. The sidecar fetches a dynamic credential at pod start, writes it to a tmpfs volume, and renews it automatically before expiry. The application reads a file — it never calls Vault directly and never sees a long-lived password.

# Deployment annotation for dynamic DB creds
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "order-service"
        vault.hashicorp.com/agent-inject-secret-db: "database/creds/app-readwrite"
        vault.hashicorp.com/agent-inject-template-db: |
          {{- with secret "database/creds/app-readwrite" -}}
          DATABASE_URL=postgres://{{ .Data.username }}:{{ .Data.password }}@aurora-primary:5432/arctiq
          {{- end }}
        vault.hashicorp.com/agent-inject-file-db: "/vault/secrets/.env"

At Google and Meta, database credentials rotate automatically every 15–60 minutes. An application that cannot handle a mid-request credential rotation (using connection-pool credential refresh) will fail under this model. Enforce this in code review: connection pools must reload credentials from the tmpfs path, not cache them in memory at startup.

Supply-Chain Security: SLSA Level 3

SolarWinds and Log4Shell proved that the attack surface is not just your code — it is every dependency in your build graph. PCI-DSS 6.3 now explicitly requires a software bill of materials (SBOM) and evidence that build processes are tamper-resistant. SLSA (Supply-chain Levels for Software Artifacts) Level 3 means: hermetic builds, signed provenance, and verification at deploy time.

Supply-chain security pipeline: hermetic build → SBOM + Cosign signature → registry scan → admission verification.

# GitHub Actions: keyless signing with Cosign + Sigstore (OIDC, no long-lived keys)
- name: Sign image with Cosign
  env:
    COSIGN_EXPERIMENTAL: "1"
  run: |
    cosign sign --yes \
      $ECR_REGISTRY/$ECR_REPO@$IMAGE_DIGEST

# Generate SBOM and attest it
- name: Generate and attest SBOM
  run: |
    syft $ECR_REGISTRY/$ECR_REPO@$IMAGE_DIGEST \
      -o cyclonedx-json > sbom.json
    cosign attest --yes \
      --predicate sbom.json \
      --type cyclonedx \
      $ECR_REGISTRY/$ECR_REPO@$IMAGE_DIGEST

# Kyverno policy: require verified signature before scheduling
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-image-signature
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: ["*"]
      verifyImages:
        - imageReferences:
            - "123456789.dkr.ecr.us-east-1.amazonaws.com/*"
          attestors:
            - entries:
                - keyless:
                    subject: "https://github.com/arctiq-commerce/*"
                    issuer: "https://token.actions.githubusercontent.com"

Never use image: myapp:latest in production manifests. latest is a mutable tag — the image digest changes silently, breaking the cosign signature and making rollbacks impossible. Always pin to the immutable digest: image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123.... The GitOps controller (Argo CD Image Updater) handles digest promotion automatically.

Policy Enforcement: Kyverno OPA at Admission

Kubernetes RBAC controls who can call the API. Admission controllers control what those calls are allowed to create. Kyverno policies enforce the security baseline across all 12 product teams without requiring them to know the rules — a pod that violates policy is simply rejected at apply time with a clear error message.

The Arctiq baseline policy set enforces: no privileged containers, no host-network or host-path mounts, resource limits on every container (prevents noisy-neighbor OOM kills), read-only root filesystems, and a restricted seccomp profile. These map directly to the Kubernetes restricted Pod Security Standard.

# kyverno-baseline.yaml — enforce resource limits and read-only rootfs
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-cpu-memory-limits
      match:
        any:
          - resources:
              kinds: [Pod]
      validate:
        message: "CPU and memory limits are required on all containers."
        pattern:
          spec:
            containers:
              - name: "*"
                resources:
                  limits:
                    cpu: "?*"
                    memory: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: readonly-rootfs
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-readonly-rootfs
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces:
                - "production-*"
                - "staging-*"
      validate:
        message: "Root filesystem must be read-only in production namespaces."
        pattern:
          spec:
            containers:
              - securityContext:
                  readOnlyRootFilesystem: true

Runtime Security: Falco Threat Detection

Admission control prevents bad configurations from being deployed. But what about attacks that happen at runtime — a container that spawns an unexpected shell, a process that reads /etc/shadow, or a compromised dependency that calls curl to exfiltrate data? Falco is the CNCF-graduated runtime security engine that detects these behaviors by watching Linux kernel syscalls via eBPF.

Falco rules alert on behavioral anomalies. A critical rule for PCI scope: any shell spawned inside a container running in the payment-processing namespace triggers a P1 alert and an automatic kubectl cordon of the node via a Falco response plugin.

# falco-custom-rules.yaml
- rule: Shell Spawned in Payment Namespace
  desc: Detect any shell process started in the payment-processing namespace
  condition: >
    spawned_process and
    container and
    k8s.ns.name = "payment-processing" and
    proc.name in (shell_binaries)
  output: >
    Shell spawned in payment namespace
    (pod=%k8s.pod.name user=%user.name shell=%proc.name parent=%proc.pname
     cmdline=%proc.cmdline image=%container.image.repository)
  priority: CRITICAL
  tags: [pci-dss, container, shell]

- rule: Outbound Connection from DB Container
  desc: Database containers must not initiate outbound connections
  condition: >
    outbound and
    container and
    k8s.deployment.name contains "postgres" and
    not fd.sport in (5432)
  output: Unexpected outbound from DB container (dest=%fd.rip:%fd.rport pod=%k8s.pod.name)
  priority: WARNING
  tags: [network, lateral-movement]

Falco with eBPF driver requires kernel 5.8+ and CAP_BPF or root. On EKS, use the managed add-on version — AWS handles the kernel module compatibility matrix. Route Falco alerts to a dedicated PagerDuty escalation policy separate from application alerts; a runtime security alert is never an on-call skip.

Compliance Posture: CIS Benchmarks and Audit Logs

PCI-DSS requires quarterly internal vulnerability scans and annual penetration tests. But a mature platform does not wait for audits — it maintains a continuous compliance score. kube-bench runs CIS Kubernetes Benchmark checks on every node and control plane component as a DaemonSet, emitting structured results to the central logging stack. Any benchmark regression in a deploy blocks the CD pipeline via a quality gate in Argo CD.

Every Kubernetes API server action is persisted in CloudWatch Logs via the EKS audit log integration. Athena queries over the audit log catch privilege-escalation patterns: any create rolebinding or patch clusterrole outside the platform-engineering team triggers an automated Slack alert and a Jira ticket for the security team to review within 24 hours — satisfying PCI-DSS Requirement 10.3.

Store Kubernetes audit logs for 12 months minimum (PCI-DSS Req 10.7). CloudWatch Log retention is set per log group — the default is "never expire," which costs money. Set it to 365 days and export to S3 Glacier after 90 days for the remaining 9 months. The cost drops from ~$0.50/GB/month (CW) to ~$0.004/GB/month (Glacier).

Defense-in-Depth Architecture

Defense-in-depth: six concentric security layers protect the Arctiq platform — each layer independently catches what the outer layer misses.

The key insight is independence: each layer catches a different class of threat. A misconfigured security group is caught by the WAF. A compromised developer credential is contained by IRSA least-privilege. An unsigned image is blocked at admission. A runtime exploit is detected by Falco before data exits. This is the architecture an auditor expects to see — and more importantly, it is the architecture that survives real incidents.

Security is not a one-time configuration. Schedule a quarterly security review that includes: rotating Vault root tokens, reviewing Kyverno policy exceptions (every exception is a debt item), re-running kube-bench after EKS upgrades, and reviewing Falco alert suppression rules for stale entries. Make this a platform team OKR, not an optional task.