Configuration Management with Ansible

Roles & Collections

18 min Lesson 7 of 30

Roles & Collections

At a certain scale, a flat playbook becomes unmanageable. A single site.yml that deploys a web tier, configures a database cluster, sets up monitoring agents, and manages TLS certificates is not infrastructure-as-code — it is a maintenance hazard. Ansible solves this with roles: a standardised directory layout that turns a logical unit of configuration (a web server, a Postgres replica, a Vault agent) into a self-contained, reusable, testable artifact. At an even larger scale, collections package roles, modules, and plugins together so entire cross-team capability packages can be versioned and distributed through Ansible Galaxy or a private Automation Hub.

This lesson covers how to structure roles correctly, how to consume and publish collections, and the reuse standards that distinguish code that survives team turnover from code that only its author understands.

Role Directory Layout

Every role follows a convention that Ansible itself enforces. Running ansible-galaxy role init myrole scaffolds the skeleton:

# Create a new role skeleton ansible-galaxy role init nginx_vhost # Resulting structure: nginx_vhost/ ├── defaults/ │ └── main.yml # Low-precedence variable defaults (overridable) ├── vars/ │ └── main.yml # High-precedence variables (not meant to be overridden) ├── tasks/ │ └── main.yml # Task entry point — include_tasks for sub-files ├── handlers/ │ └── main.yml # Handlers triggered by notify: ├── templates/ │ └── vhost.conf.j2 # Jinja2 templates ├── files/ │ └── mime.types # Static files copied verbatim ├── meta/ │ └── main.yml # Role metadata: author, license, galaxy_info, dependencies ├── tests/ │ ├── inventory # Minimal test inventory │ └── test.yml # Test playbook for molecule or direct ansible-playbook └── README.md
defaults/ vs vars/ — the most misunderstood distinction in Ansible. Variables in defaults/main.yml sit at the lowest precedence and are designed to be overridden by inventory variables, group_vars, host_vars, or playbook vars. Variables in vars/main.yml have very high precedence — they can only be overridden by command-line -e or set_fact. Use defaults/ for tunable parameters (ports, package versions, paths). Use vars/ for internal role constants that callers should never touch.

A Production-Grade Role: nginx_vhost

Here is what a complete, production-ready role tasks file looks like, demonstrating include_tasks for modular sub-files, notify for handlers, and role-relative file references:

# nginx_vhost/tasks/main.yml --- - name: Install Nginx ansible.builtin.package: name: "{{ nginx_package_name }}" state: present notify: Reload nginx - name: Ensure vhost config directory exists ansible.builtin.file: path: "{{ nginx_vhost_dir }}" state: directory mode: "0755" - name: Deploy vhost configs ansible.builtin.template: src: vhost.conf.j2 # Relative path inside role templates/ dest: "{{ nginx_vhost_dir }}/{{ item.server_name }}.conf" owner: root group: root mode: "0644" validate: "nginx -t -c %s" # Validate config before placing loop: "{{ nginx_vhosts }}" notify: Reload nginx - name: Manage TLS certificates ansible.builtin.include_tasks: tls.yml when: nginx_tls_enabled | bool # nginx_vhost/defaults/main.yml --- nginx_package_name: nginx nginx_vhost_dir: /etc/nginx/conf.d nginx_vhosts: [] nginx_tls_enabled: false # nginx_vhost/handlers/main.yml --- - name: Reload nginx ansible.builtin.service: name: nginx state: reloaded

Calling a Role from a Playbook

Roles are referenced with the roles: key or the ansible.builtin.include_role task. The latter is preferred when you need conditional role inclusion or when you need to pass variables dynamically:

# site.yml — clean separation of concerns --- - name: Configure web tier hosts: web_servers become: true roles: - role: common # Applied first — base OS hardening - role: nginx_vhost # Role with default variables vars: # Override defaults for this play nginx_tls_enabled: true nginx_vhosts: - server_name: api.example.com upstream: "127.0.0.1:8080" - role: datadog_agent # Monitoring # Dynamic role inclusion with include_role - name: Conditionally add Redis ansible.builtin.include_role: name: redis when: "'cache_servers' in group_names" vars: redis_maxmemory: "{{ ansible_memtotal_mb // 2 }}mb"

Role Dependencies and meta/main.yml

A role can declare other roles as dependencies in its meta/main.yml. Ansible ensures dependencies are applied before the dependent role. This is the correct way to model "the application role requires the common hardening role":

# nginx_vhost/meta/main.yml --- galaxy_info: role_name: nginx_vhost author: platform-eng description: Manage Nginx virtual hosts with TLS license: Apache-2.0 min_ansible_version: "2.14" platforms: - name: Ubuntu versions: ["22.04", "24.04"] - name: EL versions: ["9"] galaxy_tags: - nginx - web - tls dependencies: - role: common vars: common_firewall_ports: - 80 - 443
Ansible roles and collections dependency diagram Ansible Collection: myorg.platform Roles common OS hardening nginx_vhost depends on common postgres HA replica setup datadog_agent monitoring vault_agent secrets sidecar Modules & Plugins myorg_deploy custom module vault_lookup lookup plugin inventory_aws inventory plugin filter_semver filter plugin Playbooks site.yml, teardown.yml
An Ansible Collection bundles roles, custom modules, plugins, and playbooks into a single versioned distribution unit.

Ansible Collections

A collection is the distribution unit for Ansible content. Where a role solves one operational concern, a collection solves an entire domain — all the roles, custom modules, inventory plugins, lookup plugins, and playbooks that a team needs to manage a technology stack. The Ansible community ships collections for AWS (amazon.aws), Kubernetes (kubernetes.core), HashiCorp Vault (community.hashi_vault), and hundreds more. Your platform team ships its own internal collection (myorg.platform) from a private Automation Hub or from a Git repository.

# Install a collection from Ansible Galaxy ansible-galaxy collection install community.hashi_vault # Install a specific version ansible-galaxy collection install amazon.aws:==7.3.0 # Install from a private Automation Hub (set hub_url in ansible.cfg) ansible-galaxy collection install myorg.platform:==2.1.0 \ --server https://hub.internal.example.com \ --api-key "${AUTOMATION_HUB_TOKEN}" # Install from a Git repository directly (useful during development) ansible-galaxy collection install \ git+https://github.com/myorg/ansible-collection-platform.git,main # Use a requirements.yml for reproducible installs in CI/CD ansible-galaxy collection install -r requirements.yml # requirements.yml --- collections: - name: amazon.aws version: "7.3.0" - name: community.hashi_vault version: "6.2.0" - name: community.general version: ">=9.0.0,<10.0.0" - name: myorg.platform source: https://hub.internal.example.com version: "2.1.0" roles: - name: geerlingguy.docker version: "7.4.0"

Scaffold and Build a Collection

Building your own collection follows a strict namespace structure. The namespace is always org_name.collection_name:

# Scaffold a new collection ansible-galaxy collection init myorg.platform # Resulting layout: myorg/ └── platform/ ├── galaxy.yml # Collection manifest (name, version, deps) ├── README.md ├── roles/ # Roles live here — same layout as standalone roles │ ├── common/ │ ├── nginx_vhost/ │ └── postgres/ ├── plugins/ │ ├── modules/ # Custom Python modules │ ├── lookup/ # Lookup plugins │ ├── filter/ # Filter plugins │ └── inventory/ # Dynamic inventory plugins ├── playbooks/ │ └── site.yml └── tests/ └── integration/ # galaxy.yml — the collection manifest namespace: myorg name: platform version: 2.1.0 readme: README.md description: "Platform Engineering Ansible collection" license: - Apache-2.0 authors: - Platform Engineering <platform@example.com> dependencies: amazon.aws: ">=7.0.0" community.hashi_vault: ">=6.0.0" # Build and publish ansible-galaxy collection build # Creates myorg-platform-2.1.0.tar.gz ansible-galaxy collection publish myorg-platform-2.1.0.tar.gz \ --server https://hub.internal.example.com \ --api-key "${AUTOMATION_HUB_TOKEN}"

Using Collection Content in Playbooks

Once installed, reference collection content with its fully-qualified collection name (FQCN). Using FQCNs everywhere — not just short module names — is a hard requirement at big-tech scale because it eliminates ambiguity when multiple collections provide identically-named modules:

# Always use FQCN in production playbooks - name: Create S3 bucket amazon.aws.s3_bucket: name: myorg-artifacts region: us-east-1 versioning: true - name: Read secret from Vault community.hashi_vault.hashi_vault_kv2_get: path: secret/data/app/db_password url: "{{ vault_addr }}" auth_method: aws_iam role_id: "{{ vault_role_id }}" register: db_creds - name: Apply role from internal collection ansible.builtin.include_role: name: myorg.platform.nginx_vhost # FQCN for roles too vars: nginx_tls_enabled: true # ansible.cfg — set collections path for project isolation [defaults] collections_path = ./collections:~/.ansible/collections roles_path = ./roles:~/.ansible/roles
Big-tech reuse standard: Pin every external collection to an exact version in requirements.yml. Use a bot (Renovate, Dependabot) to open PRs when new versions are available, just as you would for Python packages or Helm charts. Never use latest or unpinned ranges in any environment except rapid prototyping — a breaking change in amazon.aws has brought down prod pipelines at several large organisations.

Testing Roles with Molecule

Molecule is the standard tool for testing Ansible roles. It spins up a container or VM, runs your role against it, and executes verifiers (Testinfra or Ansible asserts) to prove the role did what it claims. Every role in a production collection should have a Molecule scenario:

# Install Molecule with the Docker driver pip install molecule molecule-plugins[docker] ansible-lint # Initialise a Molecule scenario for an existing role cd nginx_vhost/ molecule init scenario --driver-name docker # molecule/default/molecule.yml --- driver: name: docker platforms: - name: instance image: "geerlingguy/docker-ubuntu2404-ansible:latest" pre_build_image: true command: "" privileged: false volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host provisioner: name: ansible playbooks: converge: converge.yml inventory: host_vars: instance: nginx_tls_enabled: false nginx_vhosts: - server_name: test.example.com upstream: "127.0.0.1:8080" verifier: name: ansible # molecule/default/verify.yml — Ansible-native assertions --- - name: Verify hosts: all gather_facts: false tasks: - name: Check Nginx is running ansible.builtin.service_facts: - name: Assert Nginx is active ansible.builtin.assert: that: ansible_facts.services['nginx.service'].state == 'running' - name: Check vhost config exists ansible.builtin.stat: path: /etc/nginx/conf.d/test.example.com.conf register: vhost_file - name: Assert vhost config is present ansible.builtin.assert: that: vhost_file.stat.exists # Run the full Molecule test cycle molecule test # create → converge → idempotence → verify → destroy molecule converge # Just run the role (fast iteration) molecule verify # Just run assertions
Production pitfall — the idempotence check: Molecule runs your role twice by default and compares the output. If the second run reports any changed tasks, Molecule fails the scenario. This catches the most common role authoring mistake: tasks that always claim to make a change even when the system is already in the desired state. Fix idempotence failures by adding creates: on command tasks, using changed_when: false when a task truly never modifies state, or rewriting the task to use an idempotent Ansible module instead of a raw shell command.

Role Reuse Standards at Scale

The following conventions are enforced at big-tech platform teams that maintain collections serving hundreds of engineers:

  • One role, one concern. A role configures a single technology or operational concern. A role named app_server that installs Nginx, deploys app code, configures Redis, and sets up cron jobs is four roles that were never separated. When the Redis team needs to change the Redis config, they should not need to touch a role that owns Nginx.
  • All public variables documented in defaults/main.yml with inline comments. Every variable a caller can legitimately set must appear in defaults/ with a comment explaining its purpose, type, and allowed values. Undocumented variables are bugs waiting to happen.
  • Tasks use FQCN module names. ansible.builtin.copy, not copy. This makes it unambiguous which collection provides the module and prevents silent breakage when a custom module shadows a built-in.
  • Roles are tag-aware. Add meaningful tags: to task groups so callers can run only the subset they need: ansible-playbook site.yml --tags tls. Tag everything under a role with the role name as a base tag.
  • Semantic versioning for collections. Increment the patch version for bug fixes, minor version for new backward-compatible features (new role, new variable), major version for breaking changes (renamed variable, removed role). Treat your collection changelog like a public API.

Mastering roles and collections is the point at which Ansible transitions from a scripting tool to a platform capability. The next lesson covers Ansible Vault — how to encrypt sensitive variables and manage secrets so that credentials never appear in plaintext in your repository.