Every concept taught in this tutorial — argument parsing, subprocess calls, JSON/YAML handling, HTTP clients, error handling, logging, cloud SDKs, concurrency, and testing — converges in this final lesson into a single, shippable product: a command-line tool named infracheck. It audits infrastructure health by querying live APIs and produces a structured, human-readable report. This is the exact shape of tools that SRE teams at big-tech companies build, maintain, and run from CI pipelines daily.
What infracheck Does
infracheck is a small but complete CLI tool that accepts a target environment (dev / staging / prod), queries a configurable set of endpoints and cloud resources, scores each check as PASS / WARN / FAIL, and prints a formatted report — with an optional JSON output mode suitable for piping into dashboards or alerting systems. The design mirrors real internal tools: composable checks, pluggable backends, machine-readable output, and a non-zero exit code on failure so CI breaks correctly.
infracheck architecture: CLI dispatches checks in parallel via a thread pool; a Reporter formats the aggregated results.
Project Layout
Structure the project as a proper Python package from day one. Even a small tool earns a pyproject.toml and a tests/ directory — it pays dividends the moment a second engineer touches it.
Every check is a callable that returns a CheckResult dataclass. This uniform interface is what lets the runner collect results from HTTP, AWS, DNS, and TLS checks without knowing the implementation details of any of them — a classic strategy pattern.
# infracheck/runner.py
from __future__ import annotations
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, List
logger = logging.getLogger(__name__)
class Status(str, Enum):
PASS = "PASS"
WARN = "WARN"
FAIL = "FAIL"
@dataclass
class CheckResult:
name: str
status: Status
message: str
duration_ms: float = 0.0
metadata: dict = field(default_factory=dict)
CheckFn = Callable[[], CheckResult]
class CheckRunner:
def __init__(self, checks: List[CheckFn], max_workers: int = 10):
self._checks = checks
self._max_workers = max_workers
def run(self) -> List[CheckResult]:
results: List[CheckResult] = []
with ThreadPoolExecutor(max_workers=self._max_workers) as pool:
futures = {pool.submit(fn): fn.__name__ for fn in self._checks}
for future in as_completed(futures):
name = futures[future]
try:
results.append(future.result(timeout=30))
except Exception as exc:
logger.error("Check %s raised: %s", name, exc)
results.append(CheckResult(
name=name,
status=Status.FAIL,
message=f"Unhandled exception: {exc}",
))
return sorted(results, key=lambda r: r.name)
Writing a Check: HTTP Health Endpoint
The HTTP check is the simplest and most common: hit a health endpoint, verify the status code and optionally a body key. Notice the pattern — measure wall-clock time, catch at the right level of granularity, return a typed result instead of raising.
# infracheck/checks/http.py
import time
import requests
from infracheck.runner import CheckResult, Status
def make_http_check(name: str, url: str, timeout: int = 10,
expected_status: int = 200,
body_contains: str | None = None) -> callable:
"""Factory: returns a zero-argument check function."""
def check() -> CheckResult:
t0 = time.monotonic()
try:
resp = requests.get(url, timeout=timeout,
headers={"User-Agent": "infracheck/1.0"})
duration_ms = (time.monotonic() - t0) * 1000
if resp.status_code != expected_status:
return CheckResult(
name=name, status=Status.FAIL, duration_ms=duration_ms,
message=f"Expected {expected_status}, got {resp.status_code}",
)
if body_contains and body_contains not in resp.text:
return CheckResult(
name=name, status=Status.WARN, duration_ms=duration_ms,
message=f"Body missing '{body_contains}'",
)
latency_status = Status.WARN if duration_ms > 2000 else Status.PASS
return CheckResult(
name=name, status=latency_status, duration_ms=duration_ms,
message=f"{resp.status_code} in {duration_ms:.0f}ms",
)
except requests.Timeout:
return CheckResult(name=name, status=Status.FAIL,
duration_ms=(time.monotonic() - t0) * 1000,
message=f"Timed out after {timeout}s")
except requests.ConnectionError as exc:
return CheckResult(name=name, status=Status.FAIL,
duration_ms=(time.monotonic() - t0) * 1000,
message=f"Connection error: {exc}")
check.__name__ = name
return check
The CLI Entry Point
The CLI layer, built with Click (from Lesson 6), reads an environment config from YAML, assembles the check list, calls the runner, and delegates formatting to the reporter. Exit code 1 on any FAIL is critical — it is what causes a CI pipeline step to break and page an on-call engineer.
# infracheck/cli.py
import sys
import click
import yaml
from infracheck.runner import CheckRunner, Status
from infracheck.checks.http import make_http_check
from infracheck.reporter import TextReporter, JsonReporter
@click.command()
@click.option("--env", required=True,
type=click.Choice(["dev", "staging", "prod"]),
help="Target environment to audit.")
@click.option("--config", "config_path", default="infracheck.yaml",
show_default=True, help="Path to YAML config file.")
@click.option("--output", default="text",
type=click.Choice(["text", "json"]),
help="Output format.")
@click.option("--fail-on-warn", is_flag=True, default=False,
help="Exit 1 on WARN as well as FAIL.")
def main(env: str, config_path: str, output: str, fail_on_warn: bool) -> None:
"""infracheck — audit infrastructure health and report."""
with open(config_path) as fh:
cfg = yaml.safe_load(fh)
env_cfg = cfg.get("environments", {}).get(env, {})
checks = []
for item in env_cfg.get("http_checks", []):
checks.append(make_http_check(
name=item["name"],
url=item["url"],
timeout=item.get("timeout", 10),
expected_status=item.get("expected_status", 200),
body_contains=item.get("body_contains"),
))
runner = CheckRunner(checks, max_workers=cfg.get("max_workers", 10))
results = runner.run()
reporter = JsonReporter() if output == "json" else TextReporter()
reporter.print(results)
has_fail = any(r.status == Status.FAIL for r in results)
has_warn = any(r.status == Status.WARN for r in results)
if has_fail or (fail_on_warn and has_warn):
sys.exit(1)
if __name__ == "__main__":
main()
Config File and pyproject.toml
The YAML config decouples the tool logic from environment-specific values. Never hard-code URLs, timeouts, or credentials in source code — external config enables the same binary to run against dev and prod without recompilation.
Once installed (pip install -e .), infracheck runs as a first-class CLI command from any GitHub Actions step or Jenkins stage. The non-zero exit code on failure automatically fails the step, which blocks a deployment or pages the on-call team via the CI alerting integration.
Key design principle:infracheck produces two outputs simultaneously — human-readable text on stdout for the engineer reading the CI log, and a machine-readable JSON artifact for downstream automation. Separating "run logic" from "render logic" (the Reporter class) is what makes this possible without duplicating code.
Production practice: pin your tool's dependencies in a requirements-lock.txt generated by pip-compile (pip-tools). A floating requests>=2.31 in pyproject.toml is fine for the published package, but CI must install from the lock file so a new upstream release never silently breaks your audit job at 2 AM.
Common failure mode: tools that exit 0 on unhandled exceptions. If boto3.client() raises a NoCredentialsError and your top-level exception handler swallows it and exits 0, your CI pipeline will report "infrastructure healthy" while your AWS checks never ran. Always log the traceback, emit a FAIL result, and propagate a non-zero exit code. The CheckRunner above does this correctly — the outer try/except in run() converts any exception into a FAIL result rather than silently discarding it.
What You Have Built
Across this tutorial you went from a blank Python environment to a production-grade CLI tool that encapsulates everything a professional DevOps engineer needs: configuration-driven behavior, parallel execution, typed results, formatted output, proper error handling, installability, and CI integration. infracheck is not a toy — with additional check modules (database connectivity, Kubernetes pod readiness, certificate expiry, DNS propagation), it becomes the kind of internal tool that SRE teams at Google, Cloudflare, and Stripe run continuously against every environment, 24 hours a day.