Configuration Management with Ansible

Conditionals, Loops & Error Handling

18 min Lesson 6 of 30

Conditionals, Loops & Error Handling

Ansible playbooks that only run straight-line tasks are rare in production. Real infrastructure has heterogeneous OS families, optional feature flags, retryable external calls, and partial-failure scenarios where aborting an entire play would cause more harm than recovering gracefully. This lesson covers the four mechanisms that give Ansible playbooks their expressive power: when for branching, loop for iteration, block/rescue/always for structured exception handling, and failed_when/changed_when for overriding Ansible's built-in success and change detection logic.

Conditionals with `when`

The when directive accepts a Jinja2 expression that evaluates to a boolean. When the expression is false, Ansible skips the task and reports it as skipped — not failed, not changed. This is distinct from a task that runs and does nothing; a skipped task did not execute at all.

Common patterns for when in production playbooks:

OS-family branching — use ansible_os_family or ansible_distribution facts to select the correct package manager or service name. This is the single most common use of when in fleet automation.
Registered variable tests — run a task, register its output, then conditionally act based on whether something was found, installed, or returned a particular exit code.
Variable truthiness — gate entire configuration blocks on a boolean variable (enable_tls: true) that operators pass at runtime via -e or inventory group vars.
Combining conditions — when accepts a list, which Ansible ANDs together; for OR logic use inline Jinja2 or.

# --- Conditional examples ---

# 1. OS-family branching — install nginx with the correct package manager
- name: Install nginx (Debian/Ubuntu)
  ansible.builtin.apt:
    name: nginx
    state: present
  when: ansible_os_family == "Debian"

- name: Install nginx (RHEL/CentOS/Amazon Linux)
  ansible.builtin.dnf:
    name: nginx
    state: present
  when: ansible_os_family == "RedHat"

# 2. Register + conditional — only reload if config changed
- name: Validate nginx config
  ansible.builtin.command: nginx -t
  register: nginx_test
  changed_when: false   # validation never "changes" anything

- name: Reload nginx only if validation passed and config was changed
  ansible.builtin.service:
    name: nginx
    state: reloaded
  when:
    - nginx_test.rc == 0
    - nginx_config.changed        # nginx_config registered from a template task

# 3. Variable truthiness — gate TLS configuration
- name: Deploy TLS certificates
  ansible.builtin.copy:
    src: "certs/{{ inventory_hostname }}.pem"
    dest: /etc/nginx/ssl/
    mode: '0640'
  when: enable_tls | bool

# 4. Combined conditions with OR
- name: Restart service on change or first run
  ansible.builtin.service:
    name: myapp
    state: restarted
  when: myapp_config.changed or myapp_binary.changed

Jinja2 filter tip: Always apply the | bool filter when testing a variable that might be the string "true" or "false" (common when values come from environment variables or YAML files loaded with include_vars). Without the filter, the string "false" is truthy in Python and the condition will pass unexpectedly.

Iteration with `loop`

The modern iteration directive is loop, which replaced the older with_items / with_dict / with_fileglob family (still functional but deprecated). loop accepts any list — of scalars, dicts, or the output of a lookup plugin. Inside the task body, the current iteration value is accessed as item. When iterating over dicts, access fields as item.key and item.value (or any arbitrary key you defined).

Production patterns for loop:

Creating multiple users — loop over a list of dicts, each with name, shell, groups keys.
Deploying multiple config files from templates — loop over a list of service names, rendering a distinct config per iteration.
Applying firewall rules — loop over a list of ports or CIDR blocks.
Controlling loop output — use loop_control.label to display a human-readable summary instead of the full dict in Ansible output. Critical for dicts with passwords or tokens.

# --- Loop examples ---

# 1. Create multiple system users from a list of dicts
- name: Ensure service accounts exist
  ansible.builtin.user:
    name: "{{ item.name }}"
    shell: "{{ item.shell | default('/bin/bash') }}"
    groups: "{{ item.groups | default([]) }}"
    append: true
    system: "{{ item.system | default(false) }}"
    state: present
  loop:
    - { name: deploy,   shell: /bin/bash,    groups: [docker, sudo] }
    - { name: monitor,  shell: /usr/sbin/nologin, system: true }
    - { name: backup,   shell: /usr/sbin/nologin, system: true }
  loop_control:
    label: "{{ item.name }}"   # show only the name in output, not the full dict

# 2. Open firewall ports (loop over a mixed list)
- name: Open required ports in firewalld
  ansible.posix.firewalld:
    port: "{{ item }}/tcp"
    permanent: true
    state: enabled
    immediate: true
  loop:
    - 80
    - 443
    - 8080

# 3. Deploy per-service configs from a single template
- name: Deploy microservice configs
  ansible.builtin.template:
    src: templates/service.conf.j2
    dest: "/etc/myapp/{{ item.name }}.conf"
    owner: deploy
    mode: '0644'
  loop: "{{ microservices }}"   # microservices is a list var from group_vars
  loop_control:
    label: "{{ item.name }}"
  notify: Reload myapp

# 4. Loop with index — useful when order matters
- name: Write ordered config snippets
  ansible.builtin.copy:
    content: "{{ item.content }}"
    dest: "/etc/myapp/conf.d/{{ '%02d' | format(ansible_loop.index0) }}-{{ item.name }}.conf"
  loop: "{{ config_snippets }}"
  loop_control:
    extended: true    # enables ansible_loop.index0, ansible_loop.first, .last

Ansible task control flow: when gates execution, loop iterates, and block/rescue/always structures error handling — all composable on a single task or group of tasks.

Structured Error Handling with `block`, `rescue`, and `always`

Ansible's block/rescue/always construct maps directly to try/except/finally in Python. This is the correct tool for any situation where a task failure should trigger compensating actions rather than halting the play. At big-tech scale, this pattern appears everywhere: database schema migrations that need rollback on failure, service deployments that must deregister from a load balancer before and after regardless of outcome, and API calls that need cleanup tokens released even if the main operation aborts.

# --- block / rescue / always example: deploy a service with rollback ---
- name: Deploy application with automatic rollback
  hosts: app_servers
  tasks:
    - name: Application deployment with recovery
      block:
        # Tasks inside block run normally; if ANY fails, rescue runs instead
        - name: Stop service for maintenance
          ansible.builtin.service:
            name: myapp
            state: stopped

        - name: Deploy new binary
          ansible.builtin.copy:
            src: "dist/myapp-{{ version }}"
            dest: /usr/local/bin/myapp
            mode: '0755'

        - name: Run database migrations
          ansible.builtin.command: /usr/local/bin/myapp migrate --yes
          register: migration_result

        - name: Start updated service
          ansible.builtin.service:
            name: myapp
            state: started

      rescue:
        # Runs ONLY if a task in block failed
        - name: Log the failure for incident tracking
          ansible.builtin.uri:
            url: "{{ ops_webhook_url }}"
            method: POST
            body_format: json
            body:
              event: deploy_failed
              host: "{{ inventory_hostname }}"
              version: "{{ version }}"
              error: "{{ ansible_failed_result.msg | default('unknown') }}"
          delegate_to: localhost

        - name: Restore previous binary
          ansible.builtin.copy:
            src: /usr/local/bin/myapp.prev
            dest: /usr/local/bin/myapp
            remote_src: true
            mode: '0755'

        - name: Start service on previous version
          ansible.builtin.service:
            name: myapp
            state: started

      always:
        # Runs EVERY time, success OR failure — ideal for cleanup
        - name: Re-enable health checks in load balancer
          ansible.builtin.uri:
            url: "{{ lb_api }}/hosts/{{ inventory_hostname }}/enable"
            method: PUT
            headers:
              Authorization: "Bearer {{ lb_token }}"
          delegate_to: localhost

        - name: Record deployment attempt in audit log
          ansible.builtin.lineinfile:
            path: /var/log/deployments.log
            line: "{{ lookup('pipe', 'date -Iseconds') }} host={{ inventory_hostname }} version={{ version }} result={{ ansible_failed_result is defined | ternary('FAILED', 'OK') }}"
          delegate_to: localhost

Pro practice — ansible_failed_result: Inside a rescue block, Ansible automatically sets the ansible_failed_result magic variable to the result object of the task that failed. Always log this to your alerting system or incident tracker so on-call engineers have the exact error without needing to SSH into hosts. This is the primary source of structured failure data in Ansible-managed infrastructure.

Overriding Success Detection: `failed_when` and `changed_when`

Ansible decides whether a task succeeded or changed based on module-specific logic. For the command and shell modules, any non-zero exit code is a failure and any execution is a change — but that default is wrong for many real-world scripts. failed_when and changed_when let you inject your own logic using the registered result.

failed_when — override the failure condition. Common use cases:

A CLI tool exits non-zero for "not found" (exit code 1) but that is a valid state, not an error.
A script prints "ERROR" to stdout but exits 0 (always false-succeeds).
A check command should only fail if output contains a specific pattern.

changed_when — override the changed condition. Common use cases:

Idempotent scripts that print "already up to date" when nothing changed.
Validation or check commands that never mutate state (set to false).
Scripts that print "Applied N changes" — parse N from stdout to set changed accurately.

# --- failed_when and changed_when examples ---

# 1. Service check: exit 3 ("not running") is fine for our purposes
- name: Check if legacy cron job is running
  ansible.builtin.command: systemctl is-active legacy-cron
  register: cron_status
  failed_when: cron_status.rc not in [0, 3]   # 0=active, 3=inactive — both OK
  changed_when: false                          # reading state never changes anything

# 2. Script that embeds its own change signalling in stdout
- name: Run idempotent database seeder
  ansible.builtin.command: python3 /opt/scripts/seed_db.py
  register: seed_result
  changed_when: "'rows inserted' in seed_result.stdout"
  failed_when:
    - seed_result.rc != 0
    - "'already seeded' not in seed_result.stdout"   # exit 1 + this message = OK

# 3. Custom CLI that reports errors in stderr even on success
- name: Run data sync script
  ansible.builtin.command: /usr/local/bin/sync-data --dry-run={{ dry_run | bool }}
  register: sync_out
  failed_when:
    - sync_out.rc != 0
    - "'WARN' not in sync_out.stderr"   # WARNs in stderr are acceptable
  changed_when:
    - not (dry_run | bool)              # dry run mode is never a real change
    - "'Synced 0 records' not in sync_out.stdout"

# 4. Package check — "not installed" (rc=1) is not an error here
- name: Check if legacy package exists before removing
  ansible.builtin.command: rpm -q old-package-name
  register: rpm_check
  failed_when: rpm_check.rc not in [0, 1]
  changed_when: false

- name: Remove legacy package if present
  ansible.builtin.dnf:
    name: old-package-name
    state: absent
  when: rpm_check.rc == 0

Production pitfall — overusing ignore_errors: true: A common shortcut is to add ignore_errors: true to a task that sometimes fails and "just move on." This is almost always wrong. It silently swallows real failures that should abort the play, and it prevents block/rescue from triggering because the task is considered succeeded. Use failed_when to precisely define what a failure means for your script, and use block/rescue for compensating actions. Reserve ignore_errors only for truly optional, best-effort tasks — and always register the result and log a warning message afterward so the ignored failure is visible in your output.

Combining All Four: A Production-Grade Pattern

In real playbooks these four mechanisms compose naturally. A loop iterates over servers; when gates OS-specific steps inside the loop; a block/rescue wraps the mutation steps for rollback capability; and failed_when/changed_when normalize the exit semantics of custom scripts. This is the pattern used in Google SRE-style runbooks automated with Ansible — the playbook itself is the audit trail, and every exit state is deterministic rather than relying on the human operator to know which non-zero exit codes are acceptable.

Summary

Use when for branching on facts, registered results, and boolean variables — always apply | bool when the source might be a string. Use loop with loop_control.label to iterate cleanly over lists of dicts without leaking sensitive values to stdout. Wrap mutation tasks in block/rescue/always for structured rollback and guaranteed cleanup, and capture ansible_failed_result in rescue for rich incident context. Override Ansible's default pass/fail and changed logic with failed_when/changed_when rather than silencing errors with ignore_errors. Together these patterns produce playbooks that are both deterministic and self-recovering — a prerequisite for trusting automation at production scale.

Conditionals, Loops & Error Handling

Conditionals, Loops & Error Handling

Conditionals with when

Iteration with loop

Structured Error Handling with block, rescue, and always

Overriding Success Detection: failed_when and changed_when

Combining All Four: A Production-Grade Pattern

Summary

Conditionals with `when`

Iteration with `loop`

Structured Error Handling with `block`, `rescue`, and `always`

Overriding Success Detection: `failed_when` and `changed_when`