Ansible at Scale
Ansible at Scale
Running a playbook against ten servers feels effortless. Running the same playbook against ten thousand servers — across multiple datacenters, cloud accounts, and network segments, under strict change-window constraints — is a fundamentally different engineering problem. This lesson covers the strategies, tuning knobs, and orchestration tooling that separate a hobby Ansible setup from a production-grade fleet automation platform.
Understanding Forks: Parallelism in Ansible
By default, Ansible processes only 5 hosts in parallel (the forks setting). That default is deliberately conservative and is wrong for most production fleets. Forks control how many SSH connections the control node opens simultaneously.
Three settings do the most work at scale:
forks— raise to 50-200 for cloud fleets. The practical ceiling is the control node's open-file limit (ulimit -n) and available memory (~10 MB per fork).pipelining = True— bundles the module upload and execution into one SSH call instead of three. On a 500-host playbook this can cut total run time by 30-40%.fact_caching— avoids re-running thegather_factstask on every play. With Redis, facts survive across playbook runs forfact_caching_timeoutseconds.
Execution Strategies
Ansible ships three built-in execution strategies that change how tasks are distributed across hosts.
max_fail_percentage: 0 for zero-tolerance rollouts (any failure stops the entire play). Use a graduated serial list to canary-test on a small batch first before touching the bulk of the fleet.AWX and Ansible Automation Platform (Controller)
The command-line ansible-playbook workflow does not scale organisationally. Who ran the last playbook? Against which hosts? With what variables? Did it succeed? Can a non-engineer trigger it safely? These questions are unanswerable without a control plane. That control plane is AWX (the open-source upstream) or Red Hat Ansible Automation Platform / Automation Controller (the enterprise product).
Key capabilities AWX adds over raw CLI:
- RBAC — teams get permissions to specific job templates, not shell access to the control node.
- Credential management — SSH keys, vault passwords, cloud credentials stored encrypted in the AWX database; never exposed to operators running jobs.
- Job templates — a named, versioned combination of playbook + inventory + credentials + extra vars. Anyone with access can launch it; no CLI knowledge required.
- Surveys — web forms that prompt operators for variables (e.g. target environment, version) before launching a job. Safe, auditable variable injection.
- Workflow job templates — directed acyclic graphs (DAGs) of job templates: "run hardening, then deploy, then smoke tests; if smoke tests fail, run rollback."
- Audit log — every job stores its full stdout, the user who launched it, timestamps, and outcome. Essential for compliance (SOC 2, PCI-DSS).
- Scheduling — cron-driven execution for nightly compliance enforcement.
Dynamic Inventory at Scale
Static inventory/hosts.ini files are unmaintainable beyond a few dozen hosts. At scale, inventory must be pulled dynamically from the source of truth — your cloud provider, CMDB, or service registry.
When Ansible Is the Wrong Tool: Immutable Infrastructure
Ansible excels at configuring mutable servers — machines that live long enough to warrant ongoing management. But the modern cloud trend is immutable infrastructure: bake a machine image once, deploy it, and replace rather than modify it when a change is needed.
Use Ansible for configuration management when:
- You manage long-lived VMs (database servers, legacy on-prem nodes, bare metal).
- Startup time matters and baking a new AMI for every change is too slow.
- The system cannot be replaced without data migration (stateful services).
Prefer immutable images (Packer + Ansible, Docker, AMI-based ASG) when:
- You run stateless application tiers — web servers, API nodes, workers.
- You want identical behaviour in dev, staging, and prod (the image is the artefact).
- Your threat model requires that production hosts have no SSH access whatsoever.
- You already use Kubernetes or ECS — containers are the immutable unit; host config is minimal.
ansible-playbook without a change-management gate in production. An untested playbook can fire against thousands of hosts in seconds. Always test against a staging inventory group first, use --check (dry-run) + --diff to preview changes, and restrict production job template launch permissions to senior engineers or require a second approval in AWX.Performance Profiling
When a large playbook is slow, instrument it before tuning blindly. The profile_tasks and profile_roles callback plugins ship with Ansible and add zero overhead in normal runs.
At big-tech scale, the Mitogen strategy plugin is commonly adopted for its dramatic speed improvement. It replaces the SSH-then-shell-then-Python bootstrap with a persistent in-process Python channel, cutting per-task overhead from ~200 ms to ~5 ms per host. The trade-off is an additional dependency and occasional compatibility issues with community modules that use unusual Python — always test in staging first.