Configuration Management with Ansible

Why Configuration Management?

18 min Lesson 1 of 30

Why Configuration Management?

You have shipped code through Git, built images with Docker, deployed workloads to Kubernetes, and managed cloud resources with Terraform. But every one of those systems assumes the underlying machines it runs on are in a known, consistent state. In practice they rarely are — and that gap between what you think a server looks like and what it actually is is where outages, security incidents, and deployment failures are born.

Configuration management is the discipline — and the tooling — that closes that gap. It answers a deceptively simple question: How do you guarantee that every machine in your fleet is configured exactly as intended, at all times, even as engineers make changes, packages update, and incidents leave behind ad-hoc fixes?

Ansible is the industry-standard answer for operating-system-level configuration. Before you write a single playbook, you need to feel the pain it solves.

Configuration Drift: The Silent Killer

Configuration drift is the gradual divergence of a live system from its intended baseline. It happens continuously, in small increments, and is nearly invisible until it causes a production incident.

Here is a realistic sequence of events on a production fleet with no configuration management:

  1. Week 1: an engineer hot-fixes a broken service by editing /etc/nginx/nginx.conf directly on the server. The fix is never committed to any repository.
  2. Week 4: a security team member disables TLS 1.0 by adding a line to /etc/ssl/openssl.cnf on one node to test it, then forgets to apply it to the other eleven nodes in the pool.
  3. Week 9: a kernel update is applied to nine out of twelve servers during a maintenance window; a network blip causes the remaining three to be missed. The twelve nodes now run different kernel versions.
  4. Week 15: a junior engineer runs pip install --upgrade requests on app-server-07 to debug a library issue. The version bump silently breaks a dependency on that one node.
  5. Week 16: the service starts throwing 500 errors intermittently. The errors affect only 25% of requests because only 25% of traffic happens to land on the drifted nodes. Debugging takes six hours.

None of those individual changes were malicious. Each one seemed reasonable in isolation. But together they turned a fleet of twelve nominally identical servers into twelve unique organisms — each with its own history, its own quirks, its own failure modes.

Drift compounds over time. A fleet that has been running for six months without configuration management is not twelve servers — it is twelve separate snowflakes, each one impossible to reproduce or reason about. The longer you wait to enforce consistency, the harder it becomes, because no one has a complete record of what was changed, when, or why.

Snowflake Servers: The Anti-Pattern

A snowflake server is a host that has drifted so far from any documented baseline that it has become irreplaceable. Like an actual snowflake, no two are alike — and like a snowflake, they are fragile.

Symptoms of snowflake servers in production:

  • "Only Bob knows how to configure that box." If Bob is unavailable, no one can reproduce what he built.
  • Disaster recovery drills fail because the runbook does not reflect reality. You discover discrepancies mid-drill.
  • Scaling out is impossible. You cannot clone a node because you do not know its exact state. You spin up a new instance and it behaves differently from existing ones.
  • Security audits find unexpected packages, open ports, or modified system files with no change record.
  • The deployment pipeline works on eight of twelve nodes and silently fails (or produces wrong results) on the other four.

The antidote to snowflakes is treating servers as cattle, not pets: every node is interchangeable, reproducible, and disposable. If a node misbehaves, you do not SSH in and debug it — you terminate it and let the provisioning system replace it with a known-good one. Configuration management is the toolchain that makes this possible at the OS level.

The pets vs cattle metaphor originates from cloud-scale operations at companies like Netflix and Google. Pets have names, are nursed back to health when sick, and are irreplaceable. Cattle are numbered, replaceable, and slaughtered when they get sick. Modern infrastructure treats servers as cattle: identical, numbered, replaced rather than repaired. Configuration management is what makes cattle possible — it defines what a healthy animal looks like.

Push vs Pull: The Two Configuration Models

Configuration management tools are broadly split into two architectural models. Understanding the difference determines which tool fits which environment — and directly informs why Ansible made the architectural choices it did.

Push vs Pull configuration management models PUSH MODEL (Ansible) Control Node ansible-playbook web-01 no agent web-02 no agent db-01 no agent SSH Push Characteristics + Agentless (SSH only) + Low footprint on nodes + Simple bootstrapping − Runs only on demand − Drift not auto-corrected − Control node is SPOF PULL MODEL (Puppet / Chef) Policy Server Puppet Master web-01 puppet agent web-02 puppet agent db-01 puppet agent poll every 30 min Pull Characteristics + Continuous enforcement + Auto drift correction + Scales to thousands − Agent on every node − Server is SPOF + harder − Complex bootstrapping
Push model (Ansible, top) vs Pull model (Puppet/Chef, bottom) — key architectural trade-offs at a glance.

Push Model: Ansible's Approach

In a push model, a central control node connects out to managed nodes over SSH (or WinRM for Windows), transfers a small Python payload, executes it, and reports results back. The managed nodes require no permanently running agent — only Python and an SSH daemon, both of which are present on virtually every Linux server by default.

Ansible is the canonical push-model tool. When you run a playbook, Ansible:

  1. Reads your inventory (a list of hosts and groups).
  2. Opens parallel SSH connections to all targeted hosts.
  3. Sends compressed Python modules to a temp directory on each host.
  4. Executes the modules; they make the necessary changes and report back a JSON result.
  5. Cleans up the temp files and closes the connections.

The critical implication: Ansible only runs when you invoke it. If an engineer manually changes a file on a node an hour after you ran the playbook, Ansible has no idea. You must schedule playbook runs (via cron, AWX, or Ansible Automation Platform) to periodically re-enforce your desired state — the enforcement is periodic, not continuous.

Pull Model: Puppet, Chef, and CFEngine

In a pull model, every managed node runs a persistent agent daemon. The agent periodically contacts a central policy server (every 30 minutes by default in Puppet), retrieves the current desired state (a "catalog" or "cookbook"), compares it against local reality, and applies any corrective changes — all without any human intervention.

Pull tools like Puppet and Chef were the dominant configuration management paradigm at large-scale companies through the 2000s and early 2010s. They excel at continuous enforcement: if a file is changed manually on a node, the agent corrects it within 30 minutes automatically. The trade-off is operational complexity: you must maintain a highly available policy server, manage TLS certificates for agent-server authentication, and install and maintain the agent on every node you manage — including during initial bootstrapping, which becomes a chicken-and-egg problem.

Production reality at big-tech scale: Most large organizations use Ansible for ad-hoc changes, bootstrapping new nodes, and orchestrating multi-tier deployments, while pairing it with a pull-based tool (or a GitOps loop that re-runs Ansible on a schedule via AWX/AAP) for continuous drift enforcement. Pure Ansible shops at scale typically run playbooks every 15–60 minutes via AWX/Ansible Automation Platform to approximate continuous enforcement. The choice between push and pull is rarely absolute — it is a spectrum.

What Ansible Manages: The Scope of Configuration

Before writing a single task, it helps to enumerate what configuration management actually controls at the OS level. Ansible can manage:

  • Packages — install, remove, or pin specific versions of OS packages via apt, yum, dnf, or pip.
  • Files and templates — deploy configuration files from Jinja2 templates, set permissions, ownership, and SELinux contexts.
  • Services — ensure daemons are running (or stopped), enabled on boot, and restarted when configuration changes.
  • Users and groups — create service accounts, set SSH authorized keys, manage sudo rules.
  • Firewall rules — manage iptables, firewalld, or ufw rules declaratively.
  • Kernel parameters — set sysctl values (e.g., net.core.somaxconn for high-connection workloads).
  • Mounts and storage — format partitions, configure LVM, manage /etc/fstab entries.
  • Cloud resources — via provider modules, Ansible can also provision AWS EC2, Azure VMs, GCP instances — though Terraform is preferred for that layer.

Together these primitives let you describe a server's entire intended state as code — versionable, reviewable, and reproducible.

# A trivial taste: ensure nginx is installed, running, and starts on boot # This is a single Ansible "task" — you will write full playbooks in Lesson 4 - name: Ensure nginx is installed ansible.builtin.package: name: nginx state: present - name: Ensure nginx is running and enabled ansible.builtin.service: name: nginx state: started enabled: true

Those two tasks, applied to a hundred servers in parallel, guarantee nginx is in the correct state on every one of them — regardless of what was done to those servers manually before you ran the playbook. That is the core value proposition of configuration management: desired state wins over accumulated history.

The Ansible Ecosystem in 2025

Ansible itself is the open-source command-line tool (ansible, ansible-playbook, ansible-galaxy). Red Hat ships two enterprise layers on top of it:

  • AWX — the open-source web UI, REST API, and job scheduler for Ansible. Self-hosted.
  • Ansible Automation Platform (AAP) — Red Hat's supported commercial product (AAP 2.x runs on Kubernetes). Used by most Fortune 500 companies running Ansible at scale.

At big-tech companies using open-source stacks, AWX is the standard control plane — it provides role-based access control, job templates, scheduled runs, credentials vault, and an audit log for every playbook execution across the fleet.

Why learn Ansible in a Kubernetes-first world? Kubernetes manages containerized workloads beautifully — but it does not configure the nodes those containers run on. Someone must install the container runtime, configure kernel parameters, harden SSH, rotate system certificates, manage log shipping agents, and enforce OS-level security policies on every node in the cluster. That someone is Ansible. Even in fully managed Kubernetes environments (EKS, GKE, AKS), Ansible manages the workstations, build servers, bastion hosts, and the surrounding non-K8s infrastructure. Configuration management is not replaced by containers — it is complemented by them.

The rest of this tutorial builds you from zero to a complete Ansible practitioner: inventory design, modules, playbooks, variables and templates, roles, secrets management with Ansible Vault, scaling to hundreds of hosts, and a capstone project that configures a realistic multi-node fleet. By the end, you will be able to replace any snowflake server in your fleet with a reproducible, version-controlled configuration — and do it in minutes, not hours.