Python for DevOps Automation

Building CLI Tools

18 min Lesson 6 of 28

Building CLI Tools

At some point every ops script graduates from a one-liner you type manually to a shareable tool that teammates install and run daily. The moment that happens, a raw script breaks down: it has no --help text, fails cryptically when a required argument is missing, exits 0 even on error, and cannot be piped cleanly into other commands. A proper CLI tool solves all of these — and in the DevOps world, building CLIs is a core engineering skill. Tools like the AWS CLI, kubectl, gh, and terraform are all CLI programs that operators trust with production systems every day.

This lesson covers two approaches — the standard-library argparse for simple tools, and the third-party click for anything you expect to grow — plus exit-code conventions and the pyproject.toml packaging step that turns a script into a distributable command.

argparse: The Standard-Library Baseline

argparse is part of the Python standard library. It parses sys.argv, validates types, generates --help automatically, and raises clean errors for missing required arguments. For an internal one-off tool that will not grow beyond a single command, argparse is the right choice because it has zero dependencies.

#!/usr/bin/env python3
"""ops-check: verify that a set of services are reachable before a deploy."""

import argparse
import sys
import subprocess

def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        prog="ops-check",
        description="Pre-deploy service reachability checker",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  ops-check --hosts db.internal cache.internal --port 6379
  ops-check --hosts api.example.com --timeout 5 --verbose
""",
    )
    parser.add_argument(
        "--hosts",
        nargs="+",            # one or more values
        required=True,
        metavar="HOST",
        help="Hostnames or IPs to check",
    )
    parser.add_argument(
        "--port",
        type=int,
        default=80,
        help="TCP port to probe (default: 80)",
    )
    parser.add_argument(
        "--timeout",
        type=float,
        default=3.0,
        help="Connection timeout in seconds (default: 3.0)",
    )
    parser.add_argument(
        "--verbose", "-v",
        action="store_true",
        help="Print per-host results",
    )
    return parser.parse_args()


def check_host(host: str, port: int, timeout: float) -> bool:
    result = subprocess.run(
        ["nc", "-z", "-w", str(int(timeout)), host, str(port)],
        capture_output=True,
    )
    return result.returncode == 0


def main() -> None:
    args = parse_args()
    failures = []

    for host in args.hosts:
        ok = check_host(host, args.port, args.timeout)
        if args.verbose:
            status = "OK" if ok else "FAIL"
            print(f"  [{status}] {host}:{args.port}")
        if not ok:
            failures.append(host)

    if failures:
        print(f"UNREACHABLE: {', '.join(failures)}", file=sys.stderr)
        sys.exit(1)               # non-zero — CI pipeline sees this as a failure

    print(f"All {len(args.hosts)} host(s) reachable on port {args.port}.")
    sys.exit(0)


if __name__ == "__main__":
    main()

Key idea — always call sys.exit() explicitly. When a function returns normally, Python exits with code 0. That is wrong for an ops tool that detected failures. Every code path in main() must end with an explicit sys.exit(code). CI/CD systems, shell scripts, and monitoring wrappers all gate on exit codes, not on output text.

Exit Code Conventions Every Ops Tool Must Follow

Exit codes are the Unix API between programs. A tool that exits 0 on failure is broken by definition — any shell script, pipeline stage, or monitoring check that calls it will silently move on. The conventions are simple and non-negotiable:

0 — success, everything worked as expected.
1 — general operational failure (service unreachable, validation failed, resource not found).
2 — misuse of the CLI itself (wrong arguments, missing required flag). argparse and click exit 2 automatically for argument errors — do not override this.
130 — script interrupted by the user (Ctrl+C, SIGINT). Catch KeyboardInterrupt and exit 130 so the shell knows the user stopped it intentionally.

Never use exit codes above 125 for application-level errors — those codes are reserved by the shell for signal termination. Never exit non-zero after successfully completing the task even if a warning was printed.

click: The Production-Grade Choice for Multi-Command CLIs

click (Command Line Interface Creation Kit) composes commands, subcommands, and option groups in a way that argparse cannot match cleanly. Any internal tool that grows beyond one verb — ops deploy, ops rollback, ops status — should be built on click. It also integrates naturally with rich for colored terminal output, which matters when you are reading a wall of text in a dark terminal window at 02:00.

#!/usr/bin/env python3
"""ops-cli: multi-command internal infrastructure CLI built with click."""

import sys
import click

# ── top-level group ──────────────────────────────────────────────────────────
@click.group()
@click.version_option(version="1.0.0", prog_name="ops-cli")
def cli() -> None:
    """Internal ops automation CLI. Run a subcommand with --help for details."""


# ── deploy subcommand ────────────────────────────────────────────────────────
@cli.command()
@click.argument("service")
@click.option("--env", "-e",
              type=click.Choice(["staging", "production"], case_sensitive=False),
              required=True,
              help="Target environment")
@click.option("--image-tag", default="latest", show_default=True,
              help="Docker image tag to deploy")
@click.option("--dry-run", is_flag=True,
              help="Print what would happen without making changes")
@click.pass_context
def deploy(ctx: click.Context, service: str, env: str,
           image_tag: str, dry_run: bool) -> None:
    """Deploy SERVICE to the target environment."""
    if env == "production" and not dry_run:
        # Require explicit confirmation for production
        click.confirm(
            f"Deploy {service}:{image_tag} to PRODUCTION?",
            abort=True,       # raises click.Abort -> exits 1 on 'n'
        )

    action = "[DRY-RUN] Would deploy" if dry_run else "Deploying"
    click.echo(f"{action} {service}:{image_tag} -> {env}")

    if not dry_run:
        # Real deploy logic here (kubectl set image, etc.)
        click.secho("  Deploy complete.", fg="green", bold=True)

    sys.exit(0)


# ── status subcommand ────────────────────────────────────────────────────────
@cli.command()
@click.argument("service")
@click.option("--namespace", "-n", default="default", show_default=True,
              help="Kubernetes namespace")
def status(service: str, namespace: str) -> None:
    """Show running status for SERVICE."""
    click.echo(f"Checking {service} in namespace {namespace} ...")
    # Real implementation: kubectl / boto3 calls
    click.secho("  Running (3/3 replicas ready)", fg="green")
    sys.exit(0)


if __name__ == "__main__":
    cli()

Pro practice — use click.secho() for colored output, but only when writing to a terminal. click automatically disables color when stdout is not a TTY (e.g. when the output is piped to grep or redirected to a file). Never use ANSI escape codes manually — they pollute piped output and break log parsers. The same applies to progress bars: use click.progressbar() or rich.progress.Progress, not hand-rolled carriage-return tricks.

CLI Architecture Diagram

A click multi-command CLI: the root group handles global flags and routing; each subcommand owns its arguments and exits with a meaningful code.

Packaging: From Script to Installable Command

A script becomes a real tool when you can install it with pip install . and run it as ops-cli from anywhere on the system — no python script.py prefix, no path juggling. This is done through pyproject.toml entry points, which you already saw in Lesson 1. The directory layout for a small CLI tool follows the src/ layout, which prevents the package from being accidentally imported from the project root without installing it first.

# Project layout for ops-cli
ops-cli/
├── pyproject.toml
├── README.md
├── src/
│   └── ops_cli/
│       ├── __init__.py
│       ├── main.py        # defines cli() with @click.group()
│       ├── deploy.py      # deploy subcommand
│       └── status.py      # status subcommand
└── tests/
    ├── test_deploy.py
    └── test_status.py

# pyproject.toml — the critical [project.scripts] section
[project]
name = "ops-cli"
version = "1.0.0"
requires-python = ">=3.11"
dependencies = ["click>=8.1", "rich>=13.7"]

[project.scripts]
ops-cli = "ops_cli.main:cli"   # installs the 'ops-cli' binary in PATH

# Install in editable mode during development (changes take effect immediately)
pip install -e .

# Install from a release tag in CI or on a colleague's machine
pip install git+https://github.com/your-org/ops-cli.git@v1.0.0

# Verify the installed entry point works
ops-cli --help
ops-cli --version

Production pitfall — never ship a tool with hardcoded environment names or account IDs. A tool installed on 20 engineers' laptops must read its configuration from environment variables or a config file, never from literals in the source code. Use os.environ.get("OPS_ENVIRONMENT", "staging") with a safe default, or accept the value as a CLI option (preferred, because it is explicit and auditable in shell history). Hardcoded prod account IDs are a P0 incident waiting to happen — someone runs a "test" command against the wrong target.

Testing CLI Tools with click.testing

click ships a CliRunner that invokes your CLI in-process without spawning a subprocess. This is the correct way to unit-test CLI commands — it captures stdout, stderr, and the exit code, and it works inside pytest with zero extra setup. Always test at least: the happy path, a missing-required-argument path (expect exit 2), and a failure path (expect exit 1 with a message on stderr).

# tests/test_cli.py
from click.testing import CliRunner
from ops_cli.main import cli


def test_deploy_dry_run_exits_zero():
    runner = CliRunner()
    result = runner.invoke(cli, ["deploy", "api-gateway", "--env", "staging",
                                 "--dry-run"])
    assert result.exit_code == 0
    assert "DRY-RUN" in result.output


def test_deploy_missing_env_exits_two():
    runner = CliRunner()
    result = runner.invoke(cli, ["deploy", "api-gateway"])
    assert result.exit_code == 2          # argparse/click usage error
    assert "Missing option" in result.output


def test_status_shows_service_name():
    runner = CliRunner()
    result = runner.invoke(cli, ["status", "payment-service"])
    assert result.exit_code == 0
    assert "payment-service" in result.output

# Run: pytest tests/ -v

Pro practice — ship a --output json flag on every command that produces structured data. Human-readable tables are fine for interactive use, but CI jobs and other scripts that call your tool need machine-parseable output. A single --output option that switches between table (default) and json makes your tool a first-class pipeline citizen. Use Python's json.dumps(data, indent=2) and always write it to stdout (not stderr) so callers can capture it with $(ops-cli status svc --output json).

Production Checklist for Every Ops CLI

Before handing a CLI to the rest of the team, verify these points. At companies like Stripe and Cloudflare, internal tooling goes through a brief review that checks exactly this list before it gets added to the shared developer environment:

--help on every command and subcommand — auto-generated by click if you write docstrings.
--version on the root group — use @click.version_option(); include this in every bug report template.
Meaningful exit codes on every code path — test with echo $? after each scenario.
No hardcoded credentials, account IDs, or hostnames — all configuration from environment variables or a config file.
Structured logging to stderr, results to stdout — stderr is for diagnostics; stdout is for machine-readable output. Never mix them.
Idempotent destructive operations — running ops-cli deploy twice should not cause an incident.
A --dry-run flag on every mutating command — this is the single best safety net for ops tools.