Python for DevOps Automation

Project: An Ops Automation CLI

18 min Lesson 10 of 28

Project: An Ops Automation CLI

Every concept taught in this tutorial — argument parsing, subprocess calls, JSON/YAML handling, HTTP clients, error handling, logging, cloud SDKs, concurrency, and testing — converges in this final lesson into a single, shippable product: a command-line tool named infracheck. It audits infrastructure health by querying live APIs and produces a structured, human-readable report. This is the exact shape of tools that SRE teams at big-tech companies build, maintain, and run from CI pipelines daily.

What infracheck Does

infracheck is a small but complete CLI tool that accepts a target environment (dev / staging / prod), queries a configurable set of endpoints and cloud resources, scores each check as PASS / WARN / FAIL, and prints a formatted report — with an optional JSON output mode suitable for piping into dashboards or alerting systems. The design mirrors real internal tools: composable checks, pluggable backends, machine-readable output, and a non-zero exit code on failure so CI breaks correctly.

infracheck architecture infracheck CLI main() / Click CheckRunner ThreadPoolExecutor collect results score & aggregate HTTP Check requests / status code AWS Check boto3 / EC2 / ELB DNS Check socket / resolve TLS Check ssl / cert expiry Reporter text / JSON
infracheck architecture: CLI dispatches checks in parallel via a thread pool; a Reporter formats the aggregated results.

Project Layout

Structure the project as a proper Python package from day one. Even a small tool earns a pyproject.toml and a tests/ directory — it pays dividends the moment a second engineer touches it.

infracheck/ ├── pyproject.toml ├── README.md ├── infracheck/ │ ├── __init__.py │ ├── cli.py # Click entry point │ ├── runner.py # CheckRunner with thread pool │ ├── checks/ │ │ ├── __init__.py │ │ ├── http.py │ │ ├── aws.py │ │ ├── dns.py │ │ └── tls.py │ ├── reporter.py # Text + JSON formatters │ └── config.py # Load env YAML config └── tests/ ├── test_checks.py └── test_reporter.py

The Core Abstractions

Every check is a callable that returns a CheckResult dataclass. This uniform interface is what lets the runner collect results from HTTP, AWS, DNS, and TLS checks without knowing the implementation details of any of them — a classic strategy pattern.

# infracheck/runner.py from __future__ import annotations import logging from concurrent.futures import ThreadPoolExecutor, as_completed from dataclasses import dataclass, field from enum import Enum from typing import Callable, List logger = logging.getLogger(__name__) class Status(str, Enum): PASS = "PASS" WARN = "WARN" FAIL = "FAIL" @dataclass class CheckResult: name: str status: Status message: str duration_ms: float = 0.0 metadata: dict = field(default_factory=dict) CheckFn = Callable[[], CheckResult] class CheckRunner: def __init__(self, checks: List[CheckFn], max_workers: int = 10): self._checks = checks self._max_workers = max_workers def run(self) -> List[CheckResult]: results: List[CheckResult] = [] with ThreadPoolExecutor(max_workers=self._max_workers) as pool: futures = {pool.submit(fn): fn.__name__ for fn in self._checks} for future in as_completed(futures): name = futures[future] try: results.append(future.result(timeout=30)) except Exception as exc: logger.error("Check %s raised: %s", name, exc) results.append(CheckResult( name=name, status=Status.FAIL, message=f"Unhandled exception: {exc}", )) return sorted(results, key=lambda r: r.name)

Writing a Check: HTTP Health Endpoint

The HTTP check is the simplest and most common: hit a health endpoint, verify the status code and optionally a body key. Notice the pattern — measure wall-clock time, catch at the right level of granularity, return a typed result instead of raising.

# infracheck/checks/http.py import time import requests from infracheck.runner import CheckResult, Status def make_http_check(name: str, url: str, timeout: int = 10, expected_status: int = 200, body_contains: str | None = None) -> callable: """Factory: returns a zero-argument check function.""" def check() -> CheckResult: t0 = time.monotonic() try: resp = requests.get(url, timeout=timeout, headers={"User-Agent": "infracheck/1.0"}) duration_ms = (time.monotonic() - t0) * 1000 if resp.status_code != expected_status: return CheckResult( name=name, status=Status.FAIL, duration_ms=duration_ms, message=f"Expected {expected_status}, got {resp.status_code}", ) if body_contains and body_contains not in resp.text: return CheckResult( name=name, status=Status.WARN, duration_ms=duration_ms, message=f"Body missing '{body_contains}'", ) latency_status = Status.WARN if duration_ms > 2000 else Status.PASS return CheckResult( name=name, status=latency_status, duration_ms=duration_ms, message=f"{resp.status_code} in {duration_ms:.0f}ms", ) except requests.Timeout: return CheckResult(name=name, status=Status.FAIL, duration_ms=(time.monotonic() - t0) * 1000, message=f"Timed out after {timeout}s") except requests.ConnectionError as exc: return CheckResult(name=name, status=Status.FAIL, duration_ms=(time.monotonic() - t0) * 1000, message=f"Connection error: {exc}") check.__name__ = name return check

The CLI Entry Point

The CLI layer, built with Click (from Lesson 6), reads an environment config from YAML, assembles the check list, calls the runner, and delegates formatting to the reporter. Exit code 1 on any FAIL is critical — it is what causes a CI pipeline step to break and page an on-call engineer.

# infracheck/cli.py import sys import click import yaml from infracheck.runner import CheckRunner, Status from infracheck.checks.http import make_http_check from infracheck.reporter import TextReporter, JsonReporter @click.command() @click.option("--env", required=True, type=click.Choice(["dev", "staging", "prod"]), help="Target environment to audit.") @click.option("--config", "config_path", default="infracheck.yaml", show_default=True, help="Path to YAML config file.") @click.option("--output", default="text", type=click.Choice(["text", "json"]), help="Output format.") @click.option("--fail-on-warn", is_flag=True, default=False, help="Exit 1 on WARN as well as FAIL.") def main(env: str, config_path: str, output: str, fail_on_warn: bool) -> None: """infracheck — audit infrastructure health and report.""" with open(config_path) as fh: cfg = yaml.safe_load(fh) env_cfg = cfg.get("environments", {}).get(env, {}) checks = [] for item in env_cfg.get("http_checks", []): checks.append(make_http_check( name=item["name"], url=item["url"], timeout=item.get("timeout", 10), expected_status=item.get("expected_status", 200), body_contains=item.get("body_contains"), )) runner = CheckRunner(checks, max_workers=cfg.get("max_workers", 10)) results = runner.run() reporter = JsonReporter() if output == "json" else TextReporter() reporter.print(results) has_fail = any(r.status == Status.FAIL for r in results) has_warn = any(r.status == Status.WARN for r in results) if has_fail or (fail_on_warn and has_warn): sys.exit(1) if __name__ == "__main__": main()

Config File and pyproject.toml

The YAML config decouples the tool logic from environment-specific values. Never hard-code URLs, timeouts, or credentials in source code — external config enables the same binary to run against dev and prod without recompilation.

# infracheck.yaml max_workers: 10 environments: prod: http_checks: - name: api-health url: https://api.example.com/health expected_status: 200 body_contains: "\"status\":\"ok\"" timeout: 10 - name: cdn-health url: https://cdn.example.com/ping expected_status: 200 timeout: 5 staging: http_checks: - name: staging-api url: https://staging.api.example.com/health expected_status: 200 timeout: 15
# pyproject.toml [build-system] requires = ["hatchling"] build-backend = "hatchling.build" [project] name = "infracheck" version = "1.0.0" requires-python = ">=3.11" dependencies = [ "click>=8.1", "requests>=2.31", "boto3>=1.34", "pyyaml>=6.0", ] [project.optional-dependencies] dev = ["pytest", "pytest-mock", "responses", "ruff", "mypy"] [project.scripts] infracheck = "infracheck.cli:main"

Running in CI

Once installed (pip install -e .), infracheck runs as a first-class CLI command from any GitHub Actions step or Jenkins stage. The non-zero exit code on failure automatically fails the step, which blocks a deployment or pages the on-call team via the CI alerting integration.

# .github/workflows/infra-audit.yml name: Infrastructure Audit on: schedule: - cron: "*/15 * * * *" # every 15 minutes workflow_dispatch: jobs: audit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.12" - run: pip install -e ".[dev]" - name: Run infracheck (prod) env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} AWS_DEFAULT_REGION: us-east-1 run: | infracheck --env prod --output json | tee audit-report.json infracheck --env prod --fail-on-warn - uses: actions/upload-artifact@v4 if: always() with: name: audit-report path: audit-report.json
Key design principle: infracheck produces two outputs simultaneously — human-readable text on stdout for the engineer reading the CI log, and a machine-readable JSON artifact for downstream automation. Separating "run logic" from "render logic" (the Reporter class) is what makes this possible without duplicating code.
Production practice: pin your tool's dependencies in a requirements-lock.txt generated by pip-compile (pip-tools). A floating requests>=2.31 in pyproject.toml is fine for the published package, but CI must install from the lock file so a new upstream release never silently breaks your audit job at 2 AM.
Common failure mode: tools that exit 0 on unhandled exceptions. If boto3.client() raises a NoCredentialsError and your top-level exception handler swallows it and exits 0, your CI pipeline will report "infrastructure healthy" while your AWS checks never ran. Always log the traceback, emit a FAIL result, and propagate a non-zero exit code. The CheckRunner above does this correctly — the outer try/except in run() converts any exception into a FAIL result rather than silently discarding it.

What You Have Built

Across this tutorial you went from a blank Python environment to a production-grade CLI tool that encapsulates everything a professional DevOps engineer needs: configuration-driven behavior, parallel execution, typed results, formatted output, proper error handling, installability, and CI integration. infracheck is not a toy — with additional check modules (database connectivity, Kubernetes pod readiness, certificate expiry, DNS propagation), it becomes the kind of internal tool that SRE teams at Google, Cloudflare, and Stripe run continuously against every environment, 24 hours a day.