Secrets Management & PKI

Secrets Management Principles

18 min Lesson 2 of 28

Secrets Management Principles

Before you reach for HashiCorp Vault or AWS Secrets Manager, you need to understand the principles that every serious secrets management system is built on. These four pillars — central storage, least privilege, rotation, and auditing — are not features of a specific tool. They are the architectural properties that separate a system that survives a breach from one that becomes the breach.

Teams at Google, Netflix, and Stripe do not all use the same secrets manager. What they share is ruthless adherence to these principles, enforced through policy, tooling, and culture. Understanding the "why" behind each principle is what lets you evaluate any tool — and make defensible trade-offs under pressure.

Pillar 1: Central Storage — One Source of Truth

The single most dangerous secrets anti-pattern in production is secret sprawl: the same database password lives in a .env file on the CI server, hardcoded in a Lambda deployment package, copy-pasted into Confluence, stored in a Kubernetes Secret created by hand, and emailed to three engineers who joined six months ago. When you need to rotate that credential, you have no idea where all the copies live. You will miss one. That one will be the one attackers find.

Central storage means there is exactly one authoritative location where a secret lives. Every consumer — your app, your CI pipeline, your Kubernetes pod — reads from that location at runtime. No copies, no caches with unbounded TTLs, no files checked into Git.

The "no secrets in Git" rule is absolute. A secret committed to a Git repository is permanently compromised — even if you immediately force-push to remove it. Git history is replicated to every clone, every CI system, every fork. Tools like git-secrets, trufflehog, and gitleaks exist specifically to scan repos for accidentally committed credentials. At big tech companies, a single secret found in a public repo triggers an immediate incident response and mandatory rotation within minutes.

The practical shape of central storage in 2025: a secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) with a well-defined path hierarchy, and strict policies about which paths applications can read. A concrete Vault path structure for a multi-environment setup:

# Vault KV v2 path hierarchy — enforced by policy, not convention
secret/
  data/
    production/
      api-service/
        database          # DB_URL, DB_PASSWORD
        stripe            # STRIPE_SECRET_KEY
        jwt               # JWT_SIGNING_KEY
    staging/
      api-service/
        database
        stripe
    ci/
      shared/
        docker-registry   # push credentials for CI builds only

# Reading a secret via the Vault CLI
vault kv get secret/production/api-service/database

# Reading programmatically (app startup)
vault kv get -format=json secret/production/api-service/database \
  | jq -r '.data.data.DB_PASSWORD'

The hierarchy is not cosmetic. It maps directly onto your ACL (Access Control List) policies. The api-service in production can read secret/production/api-service/* and nothing else. It cannot read staging secrets, it cannot read another service's credentials, and it absolutely cannot read CI registry credentials. This is enforced cryptographically — not by honor system.

Pillar 2: Least Privilege — Access What You Need, Nothing More

Least privilege in secrets management means every consumer has read access to exactly the secrets it needs to function, scoped to the environment it runs in, and no more. This sounds obvious, but violating it is the norm in most engineering organizations.

The most common failure mode: a shared service account or IAM role with read access to all secrets, because "it was easier to set up." When that account's credentials are compromised — stolen from a pod, leaked in a log line, phished from an engineer — the attacker has every secret in the system simultaneously.

Least-privilege scoping: each service identity is bound to only the secret paths it legitimately needs.

Implementing least privilege requires service identities — a unique, verifiable identity for each workload. In Kubernetes, this is the ServiceAccount + Vault's Kubernetes auth method. In AWS, it is an IAM role bound to an ECS task or a pod via IRSA (IAM Roles for Service Accounts). The identity is not a username and password; it is a cryptographically attested fact about what the workload is and where it runs.

# Vault policy: api-service in production (HCL)
# File: policies/api-service-prod.hcl

path "secret/data/production/api-service/*" {
  capabilities = ["read"]
}

# Explicitly deny access to other services' secrets
path "secret/data/production/payment-service/*" {
  capabilities = ["deny"]
}

# No list capability — service cannot enumerate what else exists
path "secret/metadata/production/*" {
  capabilities = ["deny"]
}

# Apply policy and bind to Kubernetes ServiceAccount
vault policy write api-service-prod policies/api-service-prod.hcl

vault write auth/kubernetes/role/api-service-prod \
  bound_service_account_names=api-service \
  bound_service_account_namespaces=production \
  policies=api-service-prod \
  ttl=1h

Deny list capability on secret paths. By default in Vault, a service that can read a path can also list what keys exist at that path. Listing the metadata of secret/production/ reveals your entire secret inventory to any compromised service — a serious information disclosure. Always explicitly deny the list and metadata capabilities on path prefixes above the service's own namespace.

Pillar 3: Rotation — Secrets Have Lifespans

A secret that never rotates is a time bomb. Every day it exists unchanged, the probability that it has been exfiltrated and not yet used grows. Rotation is the practice of replacing secrets on a regular schedule — or immediately upon suspicion of compromise — while keeping dependent services available. This is harder than it sounds.

There are two rotation patterns you need to understand:

Manual rotation: an operator generates a new credential, updates the secrets manager, and rolls pods/services to pick it up. Viable for rarely-used secrets (annual SSL certificates, long-lived service accounts). High toil, human error risk.
Automatic dynamic secrets: the secrets manager generates a fresh, short-lived credential on every request, and destroys it when the lease expires. The application never holds the same credential twice. This is the production-grade default for database passwords and cloud API keys — covered in depth in Lesson 4.

A practical rotation schedule for a production system:

# Secret rotation frequency — production baseline (2025)
#
# Dynamic secrets (Vault / cloud-native):
#   Database passwords:     TTL 1h  (auto-issued per app instance)
#   Cloud API access tokens: TTL 15m (short-lived STS/OIDC tokens)
#   Internal service tokens: TTL 30m
#
# Static secrets (manually rotated):
#   TLS certificates:       90 days (automated via cert-manager + ACME)
#   External API keys:      30-90 days (vendor-dependent, calendar reminder)
#   SSH host keys:          180 days
#   Root CA private key:    2-5 years (offline, HSM-backed)
#
# Forced immediate rotation triggers:
#   - Employee offboarding with secret access
#   - Git history scan finds a secret
#   - Vendor notifies of API key leak
#   - IDS/SIEM alert on anomalous secret usage

# AWS Secrets Manager — enable automatic rotation on a secret
aws secretsmanager rotate-secret \
  --secret-id production/api-service/database \
  --rotation-rules AutomaticallyAfterDays=30 \
  --rotate-immediately

Blue/green rotation is the safe pattern — never hard-cut. When rotating a database password, the naive approach is to update the password in the DB, update the secret, restart all apps. During the restart window, old apps hold the old (now invalid) password and fail with auth errors. The correct pattern: generate new credentials alongside the old ones, update the secret to the new value, let apps pick it up via their normal lease-renewal cycle, then revoke the old credentials only after confirming zero connections using them. Vault's built-in database secrets engine handles this for you automatically.

Pillar 4: Auditing — Every Access Must Leave a Trail

Auditing is how you answer three questions after a security event: Was this secret accessed? By whom or what? When? Without audit logs, you cannot scope a breach. You cannot tell the difference between "attacker read the database password once" and "attacker has been exfiltrating it in bulk for six months." Both look identical from the outside.

A complete audit trail for secrets access includes the secret path, the accessor identity (service, pod, human operator), the operation (read, write, delete, renew), the timestamp, the source IP, and the outcome (success or deny). These logs must be:

Immutable: written to a sink the accessor cannot modify (centralized logging, SIEM, write-once S3 bucket). An attacker with root on your app server should not be able to erase evidence of their secret reads.
Retained: regulatory minimums are 90 days to 1 year for most industries (SOC 2, PCI-DSS). Store 13 months to cover year-over-year analysis.
Alertable: anomalous patterns (bulk reads, reads from unexpected IPs, reads outside business hours, first-seen service account) should trigger real-time alerts — not just be discoverable after the fact.

# Enable Vault audit device — write to a file (pipe to your SIEM)
vault audit enable file file_path=/var/log/vault/audit.log

# Vault audit log entry (JSON, one line per event)
{
  "time": "2025-09-14T02:31:07.123Z",
  "type": "response",
  "auth": {
    "client_token": "hmac-sha256:...",
    "accessor": "hmac-sha256:...",
    "display_name": "kubernetes-production-api-service",
    "policies": ["api-service-prod"],
    "metadata": {
      "service_account_name": "api-service",
      "service_account_namespace": "production"
    }
  },
  "request": {
    "id": "3e4f8b2a-...",
    "operation": "read",
    "path": "secret/data/production/api-service/database",
    "remote_address": "10.0.1.42"
  },
  "response": {
    "mount_type": "kv"
  },
  "error": ""
}

# Ship to Datadog (example)
vault audit enable socket address=localhost:10514 socket_type=tcp

# Alert rule: secret read from unexpected CIDR (pseudocode for SIEM)
# IF event.path MATCHES "secret/data/production/*"
# AND event.remote_address NOT IN [10.0.0.0/8, 172.16.0.0/12]
# THEN page on-call immediately

Vault's audit log is a hard dependency — not optional. Vault refuses to serve requests if all configured audit devices are unreachable. This is intentional: an unavailable audit log means you cannot prove compliance or investigate incidents. Design your audit pipeline (log shipper, SIEM endpoint) for high availability before enabling in production. Use two audit devices (file + socket) for redundancy.

Putting It Together: The Principle Hierarchy

These four principles are not independent checkboxes. They reinforce each other into a defense-in-depth posture:

Central storage gives you the control plane — one place to enforce policy, rotate, and audit.
Least privilege limits the blast radius when a service identity is compromised — the attacker gets only that service's secrets.
Rotation limits the time window during which a stolen credential is valid — even if rotation fails, short-lived dynamic secrets expire automatically.
Auditing ensures that compromise is detectable — both in real time (alerting) and after the fact (forensics).

The next lesson goes inside HashiCorp Vault's architecture to see exactly how it implements these principles at scale — the seal/unseal mechanism, auth methods, secret engines, and the policy language that ties them together.