Site Reliability Engineering (SRE)

Toil & Automation

18 min Lesson 4 of 29

Toil & Automation

Google's SRE book introduced a word that immediately resonated with every operations engineer who read it: toil. Not because it was new — every ops team had been drowning in it for years — but because naming it gave teams permission to treat it as a problem worth solving systematically. This lesson defines toil precisely, explains why Google's 50% rule exists and how it is enforced, and shows the practical automation patterns that top-tier SRE teams use to eliminate it at the source.

Defining Toil: What It Is and What It Is Not

Toil is not simply "work I dislike." It has a precise, operational definition that distinguishes it from valuable engineering work. Toil is work that is:

Manual: requires a human to execute each time — no automation runs it.
Repetitive: the same sequence of actions is performed again and again across time or incidents.
Automatable: a machine could perform the task if someone wrote the code.
Reactive and interrupt-driven: triggered by a ticket, a page, or a user request rather than scheduled engineering work.
Tactical, not strategic: it produces no enduring value — when you are done, the system is in exactly the same state as before the last time you did it. The on-call runbook exists because this task will need to be done again.
Scaling linearly with service growth: as request volume or fleet size doubles, the volume of this work also doubles, unless eliminated.

Examples of toil in a mature production environment: manually restarting pods that OOM-kill, rotating credentials by hand via copy-paste into a secrets manager, provisioning new database read replicas through a UI every time a service team requests one, responding to Slack pings asking "can you check why service X is slow?", manually trimming a disk that fills every three days on a known schedule, and reviewing a certificate expiry spreadsheet every Monday.

Work that is not toil, even though it feels painful: debugging a novel production failure (non-repetitive, requires human judgment), designing a new alerting schema (produces enduring value), writing the automation that eliminates a toil task (overhead, but valuable), and performing a production readiness review (one-time strategic work).

The cognitive cost of toil is underestimated. Toil is not just about the minutes spent restarting a pod. It is about the context switch: the engineer was in the middle of deep design work, a page fires, they spend 12 minutes on a manual task, and then need 20 minutes to recover concentration. At scale, an SRE team fielding 15 manual tasks per day is not doing 15 × 12 minutes of lost work — they are losing most of their creative engineering capacity to interrupt recovery.

The 50% Rule: Engineering Time Belongs to Engineering

Google's SRE model mandates that no SRE should spend more than 50% of their working time on toil. The remaining 50% must be spent on project work: automation, reliability improvements, tooling, and capacity planning — work that permanently reduces future toil or improves service health.

This is not a soft aspiration. It is an operational contract between SRE teams and the product teams they support. If an SRE team consistently exceeds the 50% toil cap:

The SRE manager escalates to product engineering leadership — the product team owns the fix.
The SRE team temporarily stops accepting new services into their portfolio until the root cause is addressed.
In extreme cases, the SRE team "returns" a service to the product team to operate themselves (the "hand-back" mechanism).

The 50% rule exists because of a mathematical reality: if SRE toil grows linearly with service growth and SREs spend 100% of their time on it, you need to hire an SRE for every N servers or requests. This is the Ops model Google was trying to escape. The 50% cap forces automation investment before a service's toil overwhelms the team.

When toil exceeds 50% of SRE time, engineering capacity is consumed and reliability compounds negatively. Staying under the cap is a structural requirement, not a preference.

Measuring Toil: You Cannot Reduce What You Cannot Count

Before automating anything, quantify toil. SRE teams track toil systematically. At minimum, instrument your on-call rotation to capture three data points per ticket/page: the category (type of task), the time spent (including context-switch cost), and whether it is automatable. Most teams do this with a label in their ticketing system and a weekly review:

# Sample weekly toil audit query — if using PagerDuty, pull via API:
# Count incidents by service + title pattern for the past 30 days

curl -s --header "Authorization: Token token=YOUR_PD_TOKEN" \
  "https://api.pagerduty.com/incidents?since=2025-05-01&until=2025-06-01&limit=100" \
  | jq '[.incidents[] | {title: .title, service: .service.summary, created: .created_at}]' \
  | jq 'group_by(.title) | map({title: .[0].title, count: length}) | sort_by(-.count) | .[0:15]'

# For Prometheus-instrumented on-call tooling, track toil hours with a gauge:
# oncall_toil_hours_total{team="sre-platform", category="manual-restart", automatable="true"} 8.5
# Alerting: if toil_ratio (toil_hours / total_hours) > 0.5 over a 2-week window, page the SRE manager.

The toil audit is a forcing function. Once a team publishes a weekly toil report, two things happen automatically. First, product engineering teams who generate the most toil become visible to their own leadership. Second, SRE engineers who are most affected by specific toil items have data to justify automation projects in quarterly planning. Toil that is invisible is toil that is never eliminated.

Automating Toil Away: Patterns and Tools

The goal is not to automate for automation's sake — it is to eliminate the human decision loop from tasks where the right action is deterministic. There are four canonical patterns used at big-tech companies:

1. Runbook-to-Bot Conversion. An on-call runbook that says "if service X OOM-kills, SSH in and restart the pod" is a bot waiting to be written. The progression: runbook (fully manual) → script invoked manually by on-call → script triggered automatically by the alert → no alert at all because the condition is self-healing. Most teams stop at step 2. SRE teams push to step 3 or 4.

2. Self-Healing Kubernetes Controllers. In Kubernetes environments, the correct place to encode recovery logic is a custom controller or an operator, not an on-call runbook. If a pod consistently OOM-kills at a predictable memory footprint, the right fix is a VPA (Vertical Pod Autoscaler) recommendation applied automatically — not a weekly manual increase. If a node pool becomes degraded, Cluster Autoscaler should drain and replace it without human involvement.

3. Event-Driven Remediation with Lambda/Cloud Functions. CloudWatch Alarm → EventBridge rule → Lambda function that calls the AWS API to take corrective action. This pattern is used heavily for RDS failovers, EC2 instance recovery, and ECS task restarts. The function logs every action taken to CloudTrail; the audit trail is automatic.

4. Policy Engines as Toil Eliminators. A surprising amount of toil comes from humans manually enforcing policies: "make sure all S3 buckets have versioning enabled," "ensure all new Lambda functions have a DLQ." These are deterministic rules. Push them into a policy engine (AWS Config Rules + auto-remediation, OPA/Gatekeeper for Kubernetes, HashiCorp Sentinel for Terraform) and the enforcement becomes continuous and automatic.

# Pattern: Auto-restart a crashed service using a Kubernetes CronJob + health-check
# (Better: use a proper operator; this shows the concept simply)

# 1. Kubernetes liveness probe — kubelet restarts the container automatically:
# In your Deployment spec:
#   livenessProbe:
#     httpGet:
#       path: /healthz
#       port: 8080
#     initialDelaySeconds: 15
#     periodSeconds: 10
#     failureThreshold: 3

# 2. AWS Lambda auto-remediation via EventBridge (Python pseudocode):
# Trigger: CloudWatch Alarm "ECS task count < desired"
import boto3

def handler(event, context):
    ecs = boto3.client('ecs')
    cluster = event['detail']['clusterArn']
    service = event['detail']['serviceArn']
    # Force a new deployment to replace unhealthy tasks
    ecs.update_service(
        cluster=cluster,
        service=service,
        forceNewDeployment=True
    )
    print(f"Forced redeployment: {service}")

# 3. AWS Config auto-remediation — enforce S3 bucket versioning:
# aws configservice put-remediation-configurations \
#   --remediation-configurations '[{
#     "ConfigRuleName": "s3-bucket-versioning-enabled",
#     "TargetType": "SSM_DOCUMENT",
#     "TargetId": "AWS-ConfigureS3BucketVersioning",
#     "Parameters": {
#       "BucketName": {"ResourceValue": {"Value": "RESOURCE_ID"}},
#       "VersioningState": {"StaticValue": {"Values": ["Enabled"]}}
#     },
#     "Automatic": true,
#     "MaximumAutomaticAttempts": 3,
#     "RetryAttemptSeconds": 60
#   }]'

The Automation Trap: When Automation Creates New Toil

Automation is not free. Badly designed automation introduces its own class of toil: the automation breaks in unexpected ways, fires at the wrong time, requires manual intervention to reset, generates noisy false-positive alerts, and becomes a system that itself needs to be operated. Big-tech SREs guard against this with three practices:

Automation must have a dry-run mode. Every remediation bot should support a --dry-run flag that logs what it would do without taking action. This allows safe testing before enabling live execution. Bots deployed without dry-run kill production.
Automation must be idempotent. If the remediation fires three times in rapid succession because of a flapping alert, the end state should be the same as if it had fired once. Non-idempotent automation (e.g., appending to a config file on each run) is worse than the toil it replaced.
Every automated action must be logged and auditable. When an automated system touches production, the audit trail must be at least as complete as it would be for a human. Log the action, the trigger, the timestamp, and the outcome to CloudTrail, a structured log sink, or both.

Do not automate a broken process. A runbook that says "when disk fills, delete the oldest log files" is a symptom of a misconfigured log rotation. Automating that runbook does not fix the underlying problem — it just makes the problem invisible until it bites you in a more complex way. Before automating any toil task, ask: should this task exist at all? Can the root cause be fixed instead? Eliminating toil by fixing the root cause is always better than eliminating toil by automating the workaround.

Measuring Toil Reduction: Closing the Loop

Automation investments need to be justified and validated. After deploying an automation, measure: how many manual interventions per week did this eliminate? What is the estimated engineering-hours saved per quarter? Did the toil-to-engineering ratio improve? These numbers go into the SRE team's quarterly review and justify further investment.

A practical Prometheus metric pattern for tracking automation efficacy:

# Prometheus metrics to track toil reduction over time.
# Expose these from your remediation bots / on-call tooling.

# Counter: how many times the automation took action (replacing a manual task)
# oncall_automation_actions_total{action="pod_restart", service="payments", result="success"} 47
# oncall_automation_actions_total{action="pod_restart", service="payments", result="failure"} 2

# Gauge: current toil ratio (updated weekly from audit data)
# oncall_toil_ratio{team="sre-platform"} 0.38

# Histogram: time saved per automated action (in minutes)
# oncall_toil_minutes_saved_total{action="pod_restart"} 564

# PromQL: alert if toil ratio exceeds 50% over 2-week window
# ALERT ToilCapExceeded
#   IF avg_over_time(oncall_toil_ratio[14d]) > 0.50
#   FOR 0m
#   LABELS { severity="warning" }
#   ANNOTATIONS { summary="SRE team {{ $labels.team }} toil ratio {{ $value | humanizePercentage }} exceeds 50% cap" }

The toil-and-automation discipline is the feedback loop that makes an SRE team self-improving. Every incident that results in a manual action is a candidate for automation. Every automation deployed is time returned to the team for reliability engineering. Run this loop consistently and the team's operational burden decreases even as the services under their care grow.