Python for DevOps Automation

Error Handling & Logging for Ops Scripts

18 min Lesson 5 of 28

Error Handling & Logging for Ops Scripts

An ops script that crashes silently at 3 AM — while a deploy pipeline waits for its exit code — is worse than no script at all. Production-grade automation must fail loudly with context, recover where it can, and leave a structured trace that an on-call engineer can parse without reading source code. This lesson covers the three pillars that make that possible: Python's exception model, the logging module, and structured (JSON) log output.

The Exception Hierarchy You Actually Need

Python's exception tree is large, but ops scripts interact with a small, predictable subset. Understanding the hierarchy tells you which handlers to write and which to let propagate.

OSError (and its aliases IOError, FileNotFoundError, PermissionError, TimeoutError) — covers every file system and network socket operation. Always catch this around file I/O and subprocess calls.
subprocess.CalledProcessError — raised by subprocess.run(..., check=True) when the child process exits non-zero. Its .returncode, .stdout, and .stderr attributes are your first debugging surface.
KeyError / ValueError — nearly always a config or API response parsing bug. Surface these immediately; catching and silencing them hides real defects.
requests.exceptions.RequestException — the base class for every requests HTTP error (connection refused, timeout, bad status after .raise_for_status()). One handler covers the whole family.
Exception — the catch-all. Use it only at the top level of a script as a last resort, and always log the full traceback before exiting non-zero.

Key idea: Never use a bare except: clause. It catches SystemExit and KeyboardInterrupt, preventing clean shutdown. Always catch at least Exception, or better, the specific exception type you expect.

Writing Robust Exception Handlers

The pattern below is the foundation of every ops script at scale. Three things distinguish it from amateur error handling: it logs the full traceback, it sets a meaningful exit code, and it never silently swallows an error it cannot recover from.

#!/usr/bin/env python3
"""restart_service.py — safely restart a systemd service with retries."""

import subprocess
import logging
import sys
import time

log = logging.getLogger(__name__)

MAX_RETRIES = 3
RETRY_DELAY = 5  # seconds


def restart_service(name: str) -> None:
    """Restart a systemd service; raise CalledProcessError on failure."""
    subprocess.run(
        ["systemctl", "restart", name],
        check=True,          # raises CalledProcessError on non-zero exit
        capture_output=True,
        text=True,
    )
    log.info("service_restarted", extra={"service": name})


def restart_with_retries(name: str) -> None:
    last_exc: Exception | None = None
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            restart_service(name)
            return
        except subprocess.CalledProcessError as exc:
            last_exc = exc
            log.warning(
                "restart_failed",
                extra={
                    "service": name,
                    "attempt": attempt,
                    "returncode": exc.returncode,
                    "stderr": exc.stderr.strip(),
                },
            )
            if attempt < MAX_RETRIES:
                time.sleep(RETRY_DELAY)
    raise RuntimeError(
        f"Service {name!r} failed to restart after {MAX_RETRIES} attempts"
    ) from last_exc


if __name__ == "__main__":
    try:
        restart_with_retries(sys.argv[1])
    except (IndexError, ValueError) as exc:
        log.error("bad_arguments", extra={"error": str(exc)})
        sys.exit(2)      # exit 2 = usage error (distinct from exit 1 = runtime error)
    except Exception:
        log.critical("unhandled_exception", exc_info=True)
        sys.exit(1)

Pro practice: Exit code conventions matter. Exit 0 = success. Exit 1 = runtime error. Exit 2 = bad arguments (same convention as many Unix tools). CI/CD systems and monitoring scripts can branch on these codes without parsing stderr.

The logging Module: Configuration That Scales

Python's built-in logging module is battle-tested and expressive enough for production use. Most ops engineers under-use it — they call logging.basicConfig(level=logging.INFO) and stop there. That approach loses structured context and makes log aggregation in tools like Datadog, Splunk, or CloudWatch Logs Insights painful.

The correct pattern is to configure the root logger once, at startup, using a dictConfig. This separates what to log (the library code) from how to format it (the entry point). Library modules never configure handlers — they only call logging.getLogger(__name__).

# logging_config.py — call setup_logging() once in __main__

import logging
import logging.config
import os


def setup_logging(level: str = "INFO") -> None:
    """Configure root logger. Call exactly once from the script entry point."""
    log_level = getattr(logging, level.upper(), logging.INFO)

    logging.config.dictConfig({
        "version": 1,
        "disable_existing_loggers": False,   # keep third-party loggers alive
        "formatters": {
            "human": {
                "format": "%(asctime)s  %(levelname)-8s  %(name)s  %(message)s",
                "datefmt": "%Y-%m-%dT%H:%M:%S",
            },
            "json": {
                "()": "logging_config.JsonFormatter",  # custom formatter (see below)
            },
        },
        "handlers": {
            "stderr": {
                "class": "logging.StreamHandler",
                "stream": "ext://sys.stderr",
                "formatter": "json" if os.getenv("LOG_FORMAT") == "json" else "human",
                "level": log_level,
            },
        },
        "root": {
            "handlers": ["stderr"],
            "level": log_level,
        },
    })

Structured Logs: The Production Standard

Human-readable log lines are pleasant on a developer laptop. In production they are a liability. Log aggregation platforms ingest JSON, index every field, and let you run queries like service:nginx status:500 | stats count by region. If your logs are plain text, you pay for parsing — and parsing is brittle.

The extra parameter of every log.* call is the mechanism. Pass a dict of key-value pairs there; a custom Formatter serialises the entire LogRecord (including those extra fields) to JSON. The pattern is invisible to the caller but transforms every log statement into a structured event.

# logging_config.py (continued) — JSON formatter

import json
import logging
import traceback


class JsonFormatter(logging.Formatter):
    """Emit one JSON object per log record, suitable for log aggregation."""

    RESERVED = frozenset(logging.LogRecord(
        "", 0, "", 0, "", (), None
    ).__dict__.keys())

    def format(self, record: logging.LogRecord) -> str:
        payload: dict = {
            "ts":      self.formatTime(record, "%Y-%m-%dT%H:%M:%S"),
            "level":   record.levelname,
            "logger":  record.name,
            "msg":     record.getMessage(),
        }

        # Merge any extra= fields the caller provided
        for key, value in record.__dict__.items():
            if key not in self.RESERVED and not key.startswith("_"):
                payload[key] = value

        # Attach exception info when present
        if record.exc_info:
            payload["exception"] = self.formatException(record.exc_info)
            payload["traceback"] = traceback.format_exception(*record.exc_info)

        return json.dumps(payload, default=str)


# --- Usage in any module ---
log = logging.getLogger(__name__)

log.info("deploy_started", extra={
    "service": "payments-api",
    "version": "v2.4.1",
    "region": "us-east-1",
    "triggered_by": "github_actions",
})
# Output (JSON):
# {"ts":"2025-03-15T14:22:01","level":"INFO","logger":"deploy",
#  "msg":"deploy_started","service":"payments-api","version":"v2.4.1",
#  "region":"us-east-1","triggered_by":"github_actions"}

The same logger emits JSON to a log aggregation platform in production and human-readable text locally — controlled by a single environment variable.

Capturing Context with LoggerAdapter

When a script manages multiple resources (five EC2 instances, ten services), every log line should carry the resource identifier without requiring you to repeat it in every extra= call. logging.LoggerAdapter solves this cleanly by injecting a fixed context dict into every record emitted through it.

Create one adapter per resource: log = logging.LoggerAdapter(base_logger, {"instance_id": iid, "region": region})
Call log.info("health_check_passed") — the adapter merges its context automatically.
This avoids the cognitive load of tracking which resource a generic log line belongs to when scanning thousands of entries in a log aggregator.

What Not to Log

Structured logging is powerful enough to accidentally exfiltrate secrets. Enforce these rules at code review:

Never log AWS credentials, API keys, or tokens — even partially. Use a redaction wrapper or pass only identifiers (key ID, not secret).
Never log user PII (email, IP, name) unless your data classification policy explicitly permits it and you have the correct retention controls.
Never log full request/response bodies from external APIs — they routinely contain secrets embedded by callers.

Production pitfall: log.debug("response body: %s", response.text) looks harmless in development but can dump megabytes of sensitive JSON into your log aggregation platform per second under load. Audit every DEBUG statement before shipping. Many teams set the production log level to INFO and strip DEBUG calls in CI with a linter rule.

Bringing It Together: A Complete Ops Script Template

The pattern below is the skeleton every new ops script at a professional DevOps shop should start from. It wires together everything in this lesson: dictConfig logging, a top-level exception handler, and structured log events with context.

#!/usr/bin/env python3
"""template.py — production-ready ops script skeleton."""

import argparse
import logging
import sys

# Import your logging_config module (from earlier in this lesson)
# from logging_config import setup_logging, JsonFormatter

log = logging.getLogger(__name__)


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--region", default="us-east-1")
    parser.add_argument("--log-level", default="INFO",
                        choices=["DEBUG", "INFO", "WARNING", "ERROR"])
    return parser.parse_args()


def main(args: argparse.Namespace) -> int:
    """Return exit code: 0 on success, 1 on error, 2 on bad input."""
    log.info("script_start", extra={"region": args.region})

    try:
        # --- your logic here ---
        pass
    except ValueError as exc:
        log.error("invalid_input", extra={"error": str(exc)})
        return 2
    except OSError as exc:
        log.error("io_error", extra={"error": str(exc), "errno": exc.errno})
        return 1
    except Exception:
        log.critical("unhandled_exception", exc_info=True)
        return 1

    log.info("script_complete")
    return 0


if __name__ == "__main__":
    args = parse_args()
    # setup_logging(args.log_level)   # uncomment with real logging_config
    logging.basicConfig(level=args.log_level)
    sys.exit(main(args))

Pro practice: At Google, all internal ops tooling follows a variant of this template. The key insight is that main() returns an integer and never calls sys.exit() directly — that is left to the if __name__ == "__main__" block. This makes the function unit-testable: a test can call main(args) and assert on the return code without forking a process.