Conditionals, Loops & Error Handling
Conditionals, Loops & Error Handling
Ansible playbooks that only run straight-line tasks are rare in production. Real infrastructure has heterogeneous OS families, optional feature flags, retryable external calls, and partial-failure scenarios where aborting an entire play would cause more harm than recovering gracefully. This lesson covers the four mechanisms that give Ansible playbooks their expressive power: when for branching, loop for iteration, block/rescue/always for structured exception handling, and failed_when/changed_when for overriding Ansible's built-in success and change detection logic.
Conditionals with when
The when directive accepts a Jinja2 expression that evaluates to a boolean. When the expression is false, Ansible skips the task and reports it as skipped — not failed, not changed. This is distinct from a task that runs and does nothing; a skipped task did not execute at all.
Common patterns for when in production playbooks:
- OS-family branching — use
ansible_os_familyoransible_distributionfacts to select the correct package manager or service name. This is the single most common use ofwhenin fleet automation. - Registered variable tests — run a task, register its output, then conditionally act based on whether something was found, installed, or returned a particular exit code.
- Variable truthiness — gate entire configuration blocks on a boolean variable (
enable_tls: true) that operators pass at runtime via-eor inventory group vars. - Combining conditions —
whenaccepts a list, which Ansible ANDs together; for OR logic use inline Jinja2or.
| bool filter when testing a variable that might be the string "true" or "false" (common when values come from environment variables or YAML files loaded with include_vars). Without the filter, the string "false" is truthy in Python and the condition will pass unexpectedly.Iteration with loop
The modern iteration directive is loop, which replaced the older with_items / with_dict / with_fileglob family (still functional but deprecated). loop accepts any list — of scalars, dicts, or the output of a lookup plugin. Inside the task body, the current iteration value is accessed as item. When iterating over dicts, access fields as item.key and item.value (or any arbitrary key you defined).
Production patterns for loop:
- Creating multiple users — loop over a list of dicts, each with
name,shell,groupskeys. - Deploying multiple config files from templates — loop over a list of service names, rendering a distinct config per iteration.
- Applying firewall rules — loop over a list of ports or CIDR blocks.
- Controlling loop output — use
loop_control.labelto display a human-readable summary instead of the full dict in Ansible output. Critical for dicts with passwords or tokens.
Structured Error Handling with block, rescue, and always
Ansible's block/rescue/always construct maps directly to try/except/finally in Python. This is the correct tool for any situation where a task failure should trigger compensating actions rather than halting the play. At big-tech scale, this pattern appears everywhere: database schema migrations that need rollback on failure, service deployments that must deregister from a load balancer before and after regardless of outcome, and API calls that need cleanup tokens released even if the main operation aborts.
ansible_failed_result: Inside a rescue block, Ansible automatically sets the ansible_failed_result magic variable to the result object of the task that failed. Always log this to your alerting system or incident tracker so on-call engineers have the exact error without needing to SSH into hosts. This is the primary source of structured failure data in Ansible-managed infrastructure.Overriding Success Detection: failed_when and changed_when
Ansible decides whether a task succeeded or changed based on module-specific logic. For the command and shell modules, any non-zero exit code is a failure and any execution is a change — but that default is wrong for many real-world scripts. failed_when and changed_when let you inject your own logic using the registered result.
failed_when — override the failure condition. Common use cases:
- A CLI tool exits non-zero for "not found" (exit code 1) but that is a valid state, not an error.
- A script prints "ERROR" to stdout but exits 0 (always false-succeeds).
- A check command should only fail if output contains a specific pattern.
changed_when — override the changed condition. Common use cases:
- Idempotent scripts that print "already up to date" when nothing changed.
- Validation or check commands that never mutate state (set to
false). - Scripts that print "Applied N changes" — parse N from stdout to set changed accurately.
ignore_errors: true: A common shortcut is to add ignore_errors: true to a task that sometimes fails and "just move on." This is almost always wrong. It silently swallows real failures that should abort the play, and it prevents block/rescue from triggering because the task is considered succeeded. Use failed_when to precisely define what a failure means for your script, and use block/rescue for compensating actions. Reserve ignore_errors only for truly optional, best-effort tasks — and always register the result and log a warning message afterward so the ignored failure is visible in your output.Combining All Four: A Production-Grade Pattern
In real playbooks these four mechanisms compose naturally. A loop iterates over servers; when gates OS-specific steps inside the loop; a block/rescue wraps the mutation steps for rollback capability; and failed_when/changed_when normalize the exit semantics of custom scripts. This is the pattern used in Google SRE-style runbooks automated with Ansible — the playbook itself is the audit trail, and every exit state is deterministic rather than relying on the human operator to know which non-zero exit codes are acceptable.
Summary
Use when for branching on facts, registered results, and boolean variables — always apply | bool when the source might be a string. Use loop with loop_control.label to iterate cleanly over lists of dicts without leaking sensitive values to stdout. Wrap mutation tasks in block/rescue/always for structured rollback and guaranteed cleanup, and capture ansible_failed_result in rescue for rich incident context. Override Ansible's default pass/fail and changed logic with failed_when/changed_when rather than silencing errors with ignore_errors. Together these patterns produce playbooks that are both deterministic and self-recovering — a prerequisite for trusting automation at production scale.