Rotation & Incident Response for Secrets
Rotation & Incident Response for Secrets
Rotation is the practice of replacing a live credential with a new one before the old one is needed by an attacker. Incident response is what you do when rotation was not fast enough — or never happened at all. At big-tech scale, both are automated, practiced, and treated with the same rigor as database failovers. This lesson covers the full lifecycle: proactive rotation strategies, detection of leaked secrets, revocation mechanics, and the runbook you need when something has already gone wrong.
Why Rotation Is Not Optional
Every long-lived credential is a liability that compounds over time. The longer a secret is valid, the more copies of it accumulate across systems, scripts, caches, and engineer laptops. Rotation resets that window. Beyond risk reduction, rotation is also how you verify your entire secrets pipeline actually works: if you cannot rotate a credential in production without downtime, your secrets architecture is broken — and you want to discover that during a planned drill, not during a breach response at 3 AM.
Rotation Strategies
There is no single rotation strategy that fits all credential types. Match the strategy to the credential lifecycle.
1. TTL-Based Rotation (Dynamic Secrets)
The cleanest model: credentials are generated on-demand and expire automatically. Vault dynamic secrets and AWS IAM roles with short-session STS tokens implement this. There is nothing to rotate because the credential was never meant to last. A database secret with a 1-hour TTL is compromised for at most 1 hour even if it leaks immediately.
2. Scheduled Rotation (Static Secrets)
For credentials that cannot be made dynamic — legacy database passwords, third-party API keys, signing keys — schedule automatic rotation on a calendar. The industry standard is 90 days for API keys and 30 days for database passwords. AWS Secrets Manager and HashiCorp Vault both support rotation schedules with zero-downtime swap logic built in.
Zero-Downtime Rotation: The Four-Step Pattern
For any stateful credential (database password, API key), a naive "change the password, update the app" sequence creates a gap that causes dropped requests. The correct four-step pattern eliminates this gap:
- Add new credential — create the new password alongside the existing one; both are valid simultaneously at the source.
- Distribute new credential — push the new secret to the secrets manager and let consumers pick it up via lease renewal. Wait until all consumers confirm they are using the new credential.
- Revoke old credential — once no consumer holds the old credential, invalidate it at the source (database, identity provider, issuing API).
- Verify and audit — check audit logs for any access using the old credential during a 15-minute grace window. Alert on any hit; that access is a post-revocation anomaly.
Detecting Leaked Secrets
Rotation on a schedule is proactive. Detection is reactive — it closes the gap between when a secret leaks and when you respond. At large companies, secret detection runs in three layers simultaneously: pre-commit (block before the credential reaches a remote), repository scanning (catch what slipped through), and behavioral anomaly detection (notice when a credential is being used from an unexpected location or rate).
Revocation Mechanics
Revocation is different from rotation: rotation replaces a credential, revocation kills it with no replacement. You revoke when you have confirmed or strongly suspect a breach. Every credential type has its own revocation path — know all of them before you need them.
- AWS IAM key:
aws iam delete-access-key --access-key-id AKIA...— immediate, no grace period. - AWS IAM role (via STS): Attach an inline deny policy to the role; STS tokens remain valid until their expiry, so the deny policy is the only way to kill active sessions before TTL.
- Vault lease:
vault lease revoke <lease_id>or revoke all leases from a path:vault lease revoke -prefix auth/token/create. - Kubernetes Secret:
kubectl delete secret <name>— then rolling-restart pods that had it mounted so they do not serve from their in-memory copy. - TLS certificate: Add to the CA's CRL or use OCSP stapling; browsers and clients check revocation status on each handshake.
- GitHub PAT / OAuth token: Revoke via the GitHub API:
DELETE /applications/{client_id}/token.
Incident Response Runbook: Suspected Leaked Secret
When a secret is suspected leaked — via a GitHub alert, engineer report, or anomaly detection — execute this runbook in order. Speed matters: every minute the credential is live is a minute the attacker uses it.
- T+0: Contain. Revoke the credential immediately at its source. Do not wait to confirm the leak — false positives cost minutes, true positives cost the company. In AWS: delete the IAM key AND apply a deny policy to the role. In Vault:
vault lease revoke -prefixthe path. - T+5: Assess blast radius. Pull the last 24 hours of audit logs for the compromised credential. What was accessed? Was data read or exfiltrated? Were any new credentials or resources created by the attacker? In AWS: filter CloudTrail by the compromised key's
userIdentity.accessKeyId. - T+15: Rotate adjacent secrets. Any secret that shared the same source, vault policy, or AWS account boundary should be rotated immediately. An attacker with one credential often uses it to enumerate or assume access to related credentials.
- T+30: Notify and document. If PII or payment data was in the blast radius, legal and compliance must be notified for breach notification timelines (GDPR: 72 hours; PCI: immediately to card brands). Start a timestamped incident document now, not after recovery.
- T+Recovery: Rotate all long-lived credentials. After the immediate fire is out, schedule rotation for every long-lived static credential in the same environment. A breach is proof that your rotation cadence was too slow.
- T+1 week: Post-mortem and systemic fix. Write a blameless post-mortem. The root cause is almost never "engineer made a mistake." It is "the system allowed a long-lived credential to exist in a place where it could be leaked." Fix the system: move to dynamic secrets, shorten TTLs, add detection gates.
Automating the Runbook with Vault Sentinel and AWS Config
Manual runbooks drift and are forgotten under pressure. The production-grade approach is to automate the first two steps of the runbook so that detection triggers revocation without human latency. Vault Enterprise's Sentinel framework can auto-revoke leases matching anomaly conditions. On AWS, a Config rule that detects an IAM key being used from a Tor exit node or a foreign country outside your baseline can auto-invoke a Lambda that immediately deletes the key and posts to your incident Slack channel — mean time to contain drops from 45 minutes to under 60 seconds.
The next lesson brings all ten topics together into a practical reference architecture for a complete secrets management system — from developer workstation to production Kubernetes cluster — with every decision point justified by the threat model built across this tutorial.