Secrets Management & PKI

Rotation & Incident Response for Secrets

18 min Lesson 9 of 28

Rotation & Incident Response for Secrets

Rotation is the practice of replacing a live credential with a new one before the old one is needed by an attacker. Incident response is what you do when rotation was not fast enough — or never happened at all. At big-tech scale, both are automated, practiced, and treated with the same rigor as database failovers. This lesson covers the full lifecycle: proactive rotation strategies, detection of leaked secrets, revocation mechanics, and the runbook you need when something has already gone wrong.

Why Rotation Is Not Optional

Every long-lived credential is a liability that compounds over time. The longer a secret is valid, the more copies of it accumulate across systems, scripts, caches, and engineer laptops. Rotation resets that window. Beyond risk reduction, rotation is also how you verify your entire secrets pipeline actually works: if you cannot rotate a credential in production without downtime, your secrets architecture is broken — and you want to discover that during a planned drill, not during a breach response at 3 AM.

The two rotation invariants: (1) The old credential must remain valid for a short overlap window so in-flight requests do not drop. (2) The new credential must be distributed to all consumers before the old one is revoked. Violating either causes an outage. Both are solved by the same tool: a secrets manager that handles the swap atomically and notifies consumers via a lease or watch mechanism.

Rotation Strategies

There is no single rotation strategy that fits all credential types. Match the strategy to the credential lifecycle.

1. TTL-Based Rotation (Dynamic Secrets)

The cleanest model: credentials are generated on-demand and expire automatically. Vault dynamic secrets and AWS IAM roles with short-session STS tokens implement this. There is nothing to rotate because the credential was never meant to last. A database secret with a 1-hour TTL is compromised for at most 1 hour even if it leaks immediately.

2. Scheduled Rotation (Static Secrets)

For credentials that cannot be made dynamic — legacy database passwords, third-party API keys, signing keys — schedule automatic rotation on a calendar. The industry standard is 90 days for API keys and 30 days for database passwords. AWS Secrets Manager and HashiCorp Vault both support rotation schedules with zero-downtime swap logic built in.

# AWS Secrets Manager — enable automatic rotation for a database secret
# Rotation happens via a Lambda function that Secrets Manager invokes

aws secretsmanager rotate-secret \
  --secret-id prod/myapp/db-password \
  --rotation-rules AutomaticallyAfterDays=30 \
  --rotation-lambda-arn arn:aws:lambda:us-east-1:123456789012:function:SecretsManagerRotation

# Check rotation status
aws secretsmanager describe-secret \
  --secret-id prod/myapp/db-password \
  --query '{RotationEnabled:RotationEnabled,LastRotatedDate:LastRotatedDate,NextRotationDate:NextRotationDate}'

# Trigger an immediate rotation (use during incident or drill)
aws secretsmanager rotate-secret \
  --secret-id prod/myapp/db-password \
  --rotate-immediately

# HashiCorp Vault — automated database rotation with zero-downtime swap
vault write database/config/prod-postgres \
  plugin_name=postgresql-database-plugin \
  connection_url="postgresql://{{username}}:{{password}}@prod-db.internal:5432/appdb" \
  allowed_roles="app-role" \
  username="vault-admin" \
  password="initial-root-password"

vault write database/roles/app-role \
  db_name=prod-postgres \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN ENCRYPTED PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT,INSERT,UPDATE,DELETE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  default_ttl=1h \
  max_ttl=4h

# Rotate the bootstrap root credential so humans no longer know it
vault write -force database/rotate-root/prod-postgres

# App reads dynamic credentials on demand
vault read database/creds/app-role
# username: v-appservice-xK9mZ   lease_duration: 1h

Zero-Downtime Rotation: The Four-Step Pattern

For any stateful credential (database password, API key), a naive "change the password, update the app" sequence creates a gap that causes dropped requests. The correct four-step pattern eliminates this gap:

Add new credential — create the new password alongside the existing one; both are valid simultaneously at the source.
Distribute new credential — push the new secret to the secrets manager and let consumers pick it up via lease renewal. Wait until all consumers confirm they are using the new credential.
Revoke old credential — once no consumer holds the old credential, invalidate it at the source (database, identity provider, issuing API).
Verify and audit — check audit logs for any access using the old credential during a 15-minute grace window. Alert on any hit; that access is a post-revocation anomaly.

The four-step zero-downtime rotation pattern eliminates the gap between credential swap and consumer update.

Detecting Leaked Secrets

Rotation on a schedule is proactive. Detection is reactive — it closes the gap between when a secret leaks and when you respond. At large companies, secret detection runs in three layers simultaneously: pre-commit (block before the credential reaches a remote), repository scanning (catch what slipped through), and behavioral anomaly detection (notice when a credential is being used from an unexpected location or rate).

# gitleaks — pre-commit hook and CI gate (blocks secrets from entering repos)
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.2
    hooks:
      - id: gitleaks

# Scan full git history on any repo you inherit
gitleaks detect --source . -v --report-path gitleaks-report.json

# GitHub Actions — block PRs that introduce secrets (blocking check)
- name: Detect secrets
  uses: gitleaks/gitleaks-action@v2
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

# trufflehog — scan GitHub org, S3, Docker images for verified secrets
trufflehog github --org=mycompany --only-verified --json \
  | jq -c '{repo:.SourceMetadata.Data.Github.repository,
            file:.SourceMetadata.Data.Github.file,
            type:.DetectorName}'

trufflehog s3 --bucket=my-app-logs-bucket --only-verified
trufflehog docker --image=myregistry.io/myapp:latest --only-verified

# CloudTrail Insights query — anomalous IAM credential use from external IPs
# (Run in CloudWatch Logs Insights against your CloudTrail log group)
fields @timestamp, userIdentity.arn, sourceIPAddress, eventName, awsRegion
| filter userIdentity.type = "IAMUser"
    and sourceIPAddress not like /^10\./
    and sourceIPAddress not like /^172\.1[6-9]\./
    and sourceIPAddress not like /^192\.168\./
| stats count(*) as calls by userIdentity.arn, sourceIPAddress, awsRegion
| sort calls desc
| limit 50

Revocation Mechanics

Revocation is different from rotation: rotation replaces a credential, revocation kills it with no replacement. You revoke when you have confirmed or strongly suspect a breach. Every credential type has its own revocation path — know all of them before you need them.

AWS IAM key: aws iam delete-access-key --access-key-id AKIA... — immediate, no grace period.
AWS IAM role (via STS): Attach an inline deny policy to the role; STS tokens remain valid until their expiry, so the deny policy is the only way to kill active sessions before TTL.
Vault lease: vault lease revoke <lease_id> or revoke all leases from a path: vault lease revoke -prefix auth/token/create.
Kubernetes Secret: kubectl delete secret <name> — then rolling-restart pods that had it mounted so they do not serve from their in-memory copy.
TLS certificate: Add to the CA's CRL or use OCSP stapling; browsers and clients check revocation status on each handshake.
GitHub PAT / OAuth token: Revoke via the GitHub API: DELETE /applications/{client_id}/token.

STS tokens cannot be revoked by deleting the IAM user. If a long-lived IAM access key generates an STS session token, deleting the IAM key does NOT invalidate the STS token — it remains valid until its max session duration (up to 12 hours by default). The correct kill switch is an IAM deny policy applied to the role or user ARN. This is the single most dangerous misconception in AWS incident response.

Incident Response Runbook: Suspected Leaked Secret

When a secret is suspected leaked — via a GitHub alert, engineer report, or anomaly detection — execute this runbook in order. Speed matters: every minute the credential is live is a minute the attacker uses it.

T+0: Contain. Revoke the credential immediately at its source. Do not wait to confirm the leak — false positives cost minutes, true positives cost the company. In AWS: delete the IAM key AND apply a deny policy to the role. In Vault: vault lease revoke -prefix the path.
T+5: Assess blast radius. Pull the last 24 hours of audit logs for the compromised credential. What was accessed? Was data read or exfiltrated? Were any new credentials or resources created by the attacker? In AWS: filter CloudTrail by the compromised key's userIdentity.accessKeyId.
T+15: Rotate adjacent secrets. Any secret that shared the same source, vault policy, or AWS account boundary should be rotated immediately. An attacker with one credential often uses it to enumerate or assume access to related credentials.
T+30: Notify and document. If PII or payment data was in the blast radius, legal and compliance must be notified for breach notification timelines (GDPR: 72 hours; PCI: immediately to card brands). Start a timestamped incident document now, not after recovery.
T+Recovery: Rotate all long-lived credentials. After the immediate fire is out, schedule rotation for every long-lived static credential in the same environment. A breach is proof that your rotation cadence was too slow.
T+1 week: Post-mortem and systemic fix. Write a blameless post-mortem. The root cause is almost never "engineer made a mistake." It is "the system allowed a long-lived credential to exist in a place where it could be leaked." Fix the system: move to dynamic secrets, shorten TTLs, add detection gates.

Run a rotation drill quarterly. Pick one production credential at random, rotate it under simulated incident conditions (war-room call, no advance notice to the team on call), and measure MTTR (Mean Time To Rotate). A well-designed secrets architecture should achieve sub-15-minute rotation for any single credential, with zero user-facing errors. If your drill takes longer, you found your weak point before an attacker did. Document the playbook, automate the steps, and re-drill.

Automating the Runbook with Vault Sentinel and AWS Config

Manual runbooks drift and are forgotten under pressure. The production-grade approach is to automate the first two steps of the runbook so that detection triggers revocation without human latency. Vault Enterprise's Sentinel framework can auto-revoke leases matching anomaly conditions. On AWS, a Config rule that detects an IAM key being used from a Tor exit node or a foreign country outside your baseline can auto-invoke a Lambda that immediately deletes the key and posts to your incident Slack channel — mean time to contain drops from 45 minutes to under 60 seconds.

The next lesson brings all ten topics together into a practical reference architecture for a complete secrets management system — from developer workstation to production Kubernetes cluster — with every decision point justified by the threat model built across this tutorial.