Configuration Management with Ansible

Conditionals, Loops & Error Handling

18 min Lesson 6 of 30

Conditionals, Loops & Error Handling

Ansible playbooks that only run straight-line tasks are rare in production. Real infrastructure has heterogeneous OS families, optional feature flags, retryable external calls, and partial-failure scenarios where aborting an entire play would cause more harm than recovering gracefully. This lesson covers the four mechanisms that give Ansible playbooks their expressive power: when for branching, loop for iteration, block/rescue/always for structured exception handling, and failed_when/changed_when for overriding Ansible's built-in success and change detection logic.

Conditionals with when

The when directive accepts a Jinja2 expression that evaluates to a boolean. When the expression is false, Ansible skips the task and reports it as skipped — not failed, not changed. This is distinct from a task that runs and does nothing; a skipped task did not execute at all.

Common patterns for when in production playbooks:

  • OS-family branching — use ansible_os_family or ansible_distribution facts to select the correct package manager or service name. This is the single most common use of when in fleet automation.
  • Registered variable tests — run a task, register its output, then conditionally act based on whether something was found, installed, or returned a particular exit code.
  • Variable truthiness — gate entire configuration blocks on a boolean variable (enable_tls: true) that operators pass at runtime via -e or inventory group vars.
  • Combining conditionswhen accepts a list, which Ansible ANDs together; for OR logic use inline Jinja2 or.
# --- Conditional examples --- # 1. OS-family branching — install nginx with the correct package manager - name: Install nginx (Debian/Ubuntu) ansible.builtin.apt: name: nginx state: present when: ansible_os_family == "Debian" - name: Install nginx (RHEL/CentOS/Amazon Linux) ansible.builtin.dnf: name: nginx state: present when: ansible_os_family == "RedHat" # 2. Register + conditional — only reload if config changed - name: Validate nginx config ansible.builtin.command: nginx -t register: nginx_test changed_when: false # validation never "changes" anything - name: Reload nginx only if validation passed and config was changed ansible.builtin.service: name: nginx state: reloaded when: - nginx_test.rc == 0 - nginx_config.changed # nginx_config registered from a template task # 3. Variable truthiness — gate TLS configuration - name: Deploy TLS certificates ansible.builtin.copy: src: "certs/{{ inventory_hostname }}.pem" dest: /etc/nginx/ssl/ mode: '0640' when: enable_tls | bool # 4. Combined conditions with OR - name: Restart service on change or first run ansible.builtin.service: name: myapp state: restarted when: myapp_config.changed or myapp_binary.changed
Jinja2 filter tip: Always apply the | bool filter when testing a variable that might be the string "true" or "false" (common when values come from environment variables or YAML files loaded with include_vars). Without the filter, the string "false" is truthy in Python and the condition will pass unexpectedly.

Iteration with loop

The modern iteration directive is loop, which replaced the older with_items / with_dict / with_fileglob family (still functional but deprecated). loop accepts any list — of scalars, dicts, or the output of a lookup plugin. Inside the task body, the current iteration value is accessed as item. When iterating over dicts, access fields as item.key and item.value (or any arbitrary key you defined).

Production patterns for loop:

  • Creating multiple users — loop over a list of dicts, each with name, shell, groups keys.
  • Deploying multiple config files from templates — loop over a list of service names, rendering a distinct config per iteration.
  • Applying firewall rules — loop over a list of ports or CIDR blocks.
  • Controlling loop output — use loop_control.label to display a human-readable summary instead of the full dict in Ansible output. Critical for dicts with passwords or tokens.
# --- Loop examples --- # 1. Create multiple system users from a list of dicts - name: Ensure service accounts exist ansible.builtin.user: name: "{{ item.name }}" shell: "{{ item.shell | default('/bin/bash') }}" groups: "{{ item.groups | default([]) }}" append: true system: "{{ item.system | default(false) }}" state: present loop: - { name: deploy, shell: /bin/bash, groups: [docker, sudo] } - { name: monitor, shell: /usr/sbin/nologin, system: true } - { name: backup, shell: /usr/sbin/nologin, system: true } loop_control: label: "{{ item.name }}" # show only the name in output, not the full dict # 2. Open firewall ports (loop over a mixed list) - name: Open required ports in firewalld ansible.posix.firewalld: port: "{{ item }}/tcp" permanent: true state: enabled immediate: true loop: - 80 - 443 - 8080 # 3. Deploy per-service configs from a single template - name: Deploy microservice configs ansible.builtin.template: src: templates/service.conf.j2 dest: "/etc/myapp/{{ item.name }}.conf" owner: deploy mode: '0644' loop: "{{ microservices }}" # microservices is a list var from group_vars loop_control: label: "{{ item.name }}" notify: Reload myapp # 4. Loop with index — useful when order matters - name: Write ordered config snippets ansible.builtin.copy: content: "{{ item.content }}" dest: "/etc/myapp/conf.d/{{ '%02d' | format(ansible_loop.index0) }}-{{ item.name }}.conf" loop: "{{ config_snippets }}" loop_control: extended: true # enables ansible_loop.index0, ansible_loop.first, .last
Ansible when + loop + block/rescue control flow Task Execution starts here when? evaluate condition false SKIPPED no-op true loop? iterate items single next item block / rescue / always block — run tasks rescue — on failure always — cleanup ok / changed / failed
Ansible task control flow: when gates execution, loop iterates, and block/rescue/always structures error handling — all composable on a single task or group of tasks.

Structured Error Handling with block, rescue, and always

Ansible's block/rescue/always construct maps directly to try/except/finally in Python. This is the correct tool for any situation where a task failure should trigger compensating actions rather than halting the play. At big-tech scale, this pattern appears everywhere: database schema migrations that need rollback on failure, service deployments that must deregister from a load balancer before and after regardless of outcome, and API calls that need cleanup tokens released even if the main operation aborts.

# --- block / rescue / always example: deploy a service with rollback --- - name: Deploy application with automatic rollback hosts: app_servers tasks: - name: Application deployment with recovery block: # Tasks inside block run normally; if ANY fails, rescue runs instead - name: Stop service for maintenance ansible.builtin.service: name: myapp state: stopped - name: Deploy new binary ansible.builtin.copy: src: "dist/myapp-{{ version }}" dest: /usr/local/bin/myapp mode: '0755' - name: Run database migrations ansible.builtin.command: /usr/local/bin/myapp migrate --yes register: migration_result - name: Start updated service ansible.builtin.service: name: myapp state: started rescue: # Runs ONLY if a task in block failed - name: Log the failure for incident tracking ansible.builtin.uri: url: "{{ ops_webhook_url }}" method: POST body_format: json body: event: deploy_failed host: "{{ inventory_hostname }}" version: "{{ version }}" error: "{{ ansible_failed_result.msg | default('unknown') }}" delegate_to: localhost - name: Restore previous binary ansible.builtin.copy: src: /usr/local/bin/myapp.prev dest: /usr/local/bin/myapp remote_src: true mode: '0755' - name: Start service on previous version ansible.builtin.service: name: myapp state: started always: # Runs EVERY time, success OR failure — ideal for cleanup - name: Re-enable health checks in load balancer ansible.builtin.uri: url: "{{ lb_api }}/hosts/{{ inventory_hostname }}/enable" method: PUT headers: Authorization: "Bearer {{ lb_token }}" delegate_to: localhost - name: Record deployment attempt in audit log ansible.builtin.lineinfile: path: /var/log/deployments.log line: "{{ lookup('pipe', 'date -Iseconds') }} host={{ inventory_hostname }} version={{ version }} result={{ ansible_failed_result is defined | ternary('FAILED', 'OK') }}" delegate_to: localhost
Pro practice — ansible_failed_result: Inside a rescue block, Ansible automatically sets the ansible_failed_result magic variable to the result object of the task that failed. Always log this to your alerting system or incident tracker so on-call engineers have the exact error without needing to SSH into hosts. This is the primary source of structured failure data in Ansible-managed infrastructure.

Overriding Success Detection: failed_when and changed_when

Ansible decides whether a task succeeded or changed based on module-specific logic. For the command and shell modules, any non-zero exit code is a failure and any execution is a change — but that default is wrong for many real-world scripts. failed_when and changed_when let you inject your own logic using the registered result.

failed_when — override the failure condition. Common use cases:

  • A CLI tool exits non-zero for "not found" (exit code 1) but that is a valid state, not an error.
  • A script prints "ERROR" to stdout but exits 0 (always false-succeeds).
  • A check command should only fail if output contains a specific pattern.

changed_when — override the changed condition. Common use cases:

  • Idempotent scripts that print "already up to date" when nothing changed.
  • Validation or check commands that never mutate state (set to false).
  • Scripts that print "Applied N changes" — parse N from stdout to set changed accurately.
# --- failed_when and changed_when examples --- # 1. Service check: exit 3 ("not running") is fine for our purposes - name: Check if legacy cron job is running ansible.builtin.command: systemctl is-active legacy-cron register: cron_status failed_when: cron_status.rc not in [0, 3] # 0=active, 3=inactive — both OK changed_when: false # reading state never changes anything # 2. Script that embeds its own change signalling in stdout - name: Run idempotent database seeder ansible.builtin.command: python3 /opt/scripts/seed_db.py register: seed_result changed_when: "'rows inserted' in seed_result.stdout" failed_when: - seed_result.rc != 0 - "'already seeded' not in seed_result.stdout" # exit 1 + this message = OK # 3. Custom CLI that reports errors in stderr even on success - name: Run data sync script ansible.builtin.command: /usr/local/bin/sync-data --dry-run={{ dry_run | bool }} register: sync_out failed_when: - sync_out.rc != 0 - "'WARN' not in sync_out.stderr" # WARNs in stderr are acceptable changed_when: - not (dry_run | bool) # dry run mode is never a real change - "'Synced 0 records' not in sync_out.stdout" # 4. Package check — "not installed" (rc=1) is not an error here - name: Check if legacy package exists before removing ansible.builtin.command: rpm -q old-package-name register: rpm_check failed_when: rpm_check.rc not in [0, 1] changed_when: false - name: Remove legacy package if present ansible.builtin.dnf: name: old-package-name state: absent when: rpm_check.rc == 0
Production pitfall — overusing ignore_errors: true: A common shortcut is to add ignore_errors: true to a task that sometimes fails and "just move on." This is almost always wrong. It silently swallows real failures that should abort the play, and it prevents block/rescue from triggering because the task is considered succeeded. Use failed_when to precisely define what a failure means for your script, and use block/rescue for compensating actions. Reserve ignore_errors only for truly optional, best-effort tasks — and always register the result and log a warning message afterward so the ignored failure is visible in your output.

Combining All Four: A Production-Grade Pattern

In real playbooks these four mechanisms compose naturally. A loop iterates over servers; when gates OS-specific steps inside the loop; a block/rescue wraps the mutation steps for rollback capability; and failed_when/changed_when normalize the exit semantics of custom scripts. This is the pattern used in Google SRE-style runbooks automated with Ansible — the playbook itself is the audit trail, and every exit state is deterministic rather than relying on the human operator to know which non-zero exit codes are acceptable.

Summary

Use when for branching on facts, registered results, and boolean variables — always apply | bool when the source might be a string. Use loop with loop_control.label to iterate cleanly over lists of dicts without leaking sensitive values to stdout. Wrap mutation tasks in block/rescue/always for structured rollback and guaranteed cleanup, and capture ansible_failed_result in rescue for rich incident context. Override Ansible's default pass/fail and changed logic with failed_when/changed_when rather than silencing errors with ignore_errors. Together these patterns produce playbooks that are both deterministic and self-recovering — a prerequisite for trusting automation at production scale.