Why Configuration Management?
Why Configuration Management?
You have shipped code through Git, built images with Docker, deployed workloads to Kubernetes, and managed cloud resources with Terraform. But every one of those systems assumes the underlying machines it runs on are in a known, consistent state. In practice they rarely are — and that gap between what you think a server looks like and what it actually is is where outages, security incidents, and deployment failures are born.
Configuration management is the discipline — and the tooling — that closes that gap. It answers a deceptively simple question: How do you guarantee that every machine in your fleet is configured exactly as intended, at all times, even as engineers make changes, packages update, and incidents leave behind ad-hoc fixes?
Ansible is the industry-standard answer for operating-system-level configuration. Before you write a single playbook, you need to feel the pain it solves.
Configuration Drift: The Silent Killer
Configuration drift is the gradual divergence of a live system from its intended baseline. It happens continuously, in small increments, and is nearly invisible until it causes a production incident.
Here is a realistic sequence of events on a production fleet with no configuration management:
- Week 1: an engineer hot-fixes a broken service by editing
/etc/nginx/nginx.confdirectly on the server. The fix is never committed to any repository. - Week 4: a security team member disables TLS 1.0 by adding a line to
/etc/ssl/openssl.cnfon one node to test it, then forgets to apply it to the other eleven nodes in the pool. - Week 9: a kernel update is applied to nine out of twelve servers during a maintenance window; a network blip causes the remaining three to be missed. The twelve nodes now run different kernel versions.
- Week 15: a junior engineer runs
pip install --upgrade requestsonapp-server-07to debug a library issue. The version bump silently breaks a dependency on that one node. - Week 16: the service starts throwing 500 errors intermittently. The errors affect only 25% of requests because only 25% of traffic happens to land on the drifted nodes. Debugging takes six hours.
None of those individual changes were malicious. Each one seemed reasonable in isolation. But together they turned a fleet of twelve nominally identical servers into twelve unique organisms — each with its own history, its own quirks, its own failure modes.
Snowflake Servers: The Anti-Pattern
A snowflake server is a host that has drifted so far from any documented baseline that it has become irreplaceable. Like an actual snowflake, no two are alike — and like a snowflake, they are fragile.
Symptoms of snowflake servers in production:
- "Only Bob knows how to configure that box." If Bob is unavailable, no one can reproduce what he built.
- Disaster recovery drills fail because the runbook does not reflect reality. You discover discrepancies mid-drill.
- Scaling out is impossible. You cannot clone a node because you do not know its exact state. You spin up a new instance and it behaves differently from existing ones.
- Security audits find unexpected packages, open ports, or modified system files with no change record.
- The deployment pipeline works on eight of twelve nodes and silently fails (or produces wrong results) on the other four.
The antidote to snowflakes is treating servers as cattle, not pets: every node is interchangeable, reproducible, and disposable. If a node misbehaves, you do not SSH in and debug it — you terminate it and let the provisioning system replace it with a known-good one. Configuration management is the toolchain that makes this possible at the OS level.
Push vs Pull: The Two Configuration Models
Configuration management tools are broadly split into two architectural models. Understanding the difference determines which tool fits which environment — and directly informs why Ansible made the architectural choices it did.
Push Model: Ansible's Approach
In a push model, a central control node connects out to managed nodes over SSH (or WinRM for Windows), transfers a small Python payload, executes it, and reports results back. The managed nodes require no permanently running agent — only Python and an SSH daemon, both of which are present on virtually every Linux server by default.
Ansible is the canonical push-model tool. When you run a playbook, Ansible:
- Reads your inventory (a list of hosts and groups).
- Opens parallel SSH connections to all targeted hosts.
- Sends compressed Python modules to a temp directory on each host.
- Executes the modules; they make the necessary changes and report back a JSON result.
- Cleans up the temp files and closes the connections.
The critical implication: Ansible only runs when you invoke it. If an engineer manually changes a file on a node an hour after you ran the playbook, Ansible has no idea. You must schedule playbook runs (via cron, AWX, or Ansible Automation Platform) to periodically re-enforce your desired state — the enforcement is periodic, not continuous.
Pull Model: Puppet, Chef, and CFEngine
In a pull model, every managed node runs a persistent agent daemon. The agent periodically contacts a central policy server (every 30 minutes by default in Puppet), retrieves the current desired state (a "catalog" or "cookbook"), compares it against local reality, and applies any corrective changes — all without any human intervention.
Pull tools like Puppet and Chef were the dominant configuration management paradigm at large-scale companies through the 2000s and early 2010s. They excel at continuous enforcement: if a file is changed manually on a node, the agent corrects it within 30 minutes automatically. The trade-off is operational complexity: you must maintain a highly available policy server, manage TLS certificates for agent-server authentication, and install and maintain the agent on every node you manage — including during initial bootstrapping, which becomes a chicken-and-egg problem.
What Ansible Manages: The Scope of Configuration
Before writing a single task, it helps to enumerate what configuration management actually controls at the OS level. Ansible can manage:
- Packages — install, remove, or pin specific versions of OS packages via
apt,yum,dnf, orpip. - Files and templates — deploy configuration files from Jinja2 templates, set permissions, ownership, and SELinux contexts.
- Services — ensure daemons are running (or stopped), enabled on boot, and restarted when configuration changes.
- Users and groups — create service accounts, set SSH authorized keys, manage sudo rules.
- Firewall rules — manage
iptables,firewalld, orufwrules declaratively. - Kernel parameters — set
sysctlvalues (e.g.,net.core.somaxconnfor high-connection workloads). - Mounts and storage — format partitions, configure LVM, manage
/etc/fstabentries. - Cloud resources — via provider modules, Ansible can also provision AWS EC2, Azure VMs, GCP instances — though Terraform is preferred for that layer.
Together these primitives let you describe a server's entire intended state as code — versionable, reviewable, and reproducible.
Those two tasks, applied to a hundred servers in parallel, guarantee nginx is in the correct state on every one of them — regardless of what was done to those servers manually before you ran the playbook. That is the core value proposition of configuration management: desired state wins over accumulated history.
The Ansible Ecosystem in 2025
Ansible itself is the open-source command-line tool (ansible, ansible-playbook, ansible-galaxy). Red Hat ships two enterprise layers on top of it:
- AWX — the open-source web UI, REST API, and job scheduler for Ansible. Self-hosted.
- Ansible Automation Platform (AAP) — Red Hat's supported commercial product (AAP 2.x runs on Kubernetes). Used by most Fortune 500 companies running Ansible at scale.
At big-tech companies using open-source stacks, AWX is the standard control plane — it provides role-based access control, job templates, scheduled runs, credentials vault, and an audit log for every playbook execution across the fleet.
The rest of this tutorial builds you from zero to a complete Ansible practitioner: inventory design, modules, playbooks, variables and templates, roles, secrets management with Ansible Vault, scaling to hundreds of hosts, and a capstone project that configures a realistic multi-node fleet. By the end, you will be able to replace any snowflake server in your fleet with a reproducible, version-controlled configuration — and do it in minutes, not hours.