Python for DevOps Automation

Building CLI Tools

18 min Lesson 6 of 28

Building CLI Tools

At some point every ops script graduates from a one-liner you type manually to a shareable tool that teammates install and run daily. The moment that happens, a raw script breaks down: it has no --help text, fails cryptically when a required argument is missing, exits 0 even on error, and cannot be piped cleanly into other commands. A proper CLI tool solves all of these — and in the DevOps world, building CLIs is a core engineering skill. Tools like the AWS CLI, kubectl, gh, and terraform are all CLI programs that operators trust with production systems every day.

This lesson covers two approaches — the standard-library argparse for simple tools, and the third-party click for anything you expect to grow — plus exit-code conventions and the pyproject.toml packaging step that turns a script into a distributable command.

argparse: The Standard-Library Baseline

argparse is part of the Python standard library. It parses sys.argv, validates types, generates --help automatically, and raises clean errors for missing required arguments. For an internal one-off tool that will not grow beyond a single command, argparse is the right choice because it has zero dependencies.

#!/usr/bin/env python3 """ops-check: verify that a set of services are reachable before a deploy.""" import argparse import sys import subprocess def parse_args() -> argparse.Namespace: parser = argparse.ArgumentParser( prog="ops-check", description="Pre-deploy service reachability checker", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" Examples: ops-check --hosts db.internal cache.internal --port 6379 ops-check --hosts api.example.com --timeout 5 --verbose """, ) parser.add_argument( "--hosts", nargs="+", # one or more values required=True, metavar="HOST", help="Hostnames or IPs to check", ) parser.add_argument( "--port", type=int, default=80, help="TCP port to probe (default: 80)", ) parser.add_argument( "--timeout", type=float, default=3.0, help="Connection timeout in seconds (default: 3.0)", ) parser.add_argument( "--verbose", "-v", action="store_true", help="Print per-host results", ) return parser.parse_args() def check_host(host: str, port: int, timeout: float) -> bool: result = subprocess.run( ["nc", "-z", "-w", str(int(timeout)), host, str(port)], capture_output=True, ) return result.returncode == 0 def main() -> None: args = parse_args() failures = [] for host in args.hosts: ok = check_host(host, args.port, args.timeout) if args.verbose: status = "OK" if ok else "FAIL" print(f" [{status}] {host}:{args.port}") if not ok: failures.append(host) if failures: print(f"UNREACHABLE: {', '.join(failures)}", file=sys.stderr) sys.exit(1) # non-zero — CI pipeline sees this as a failure print(f"All {len(args.hosts)} host(s) reachable on port {args.port}.") sys.exit(0) if __name__ == "__main__": main()
Key idea — always call sys.exit() explicitly. When a function returns normally, Python exits with code 0. That is wrong for an ops tool that detected failures. Every code path in main() must end with an explicit sys.exit(code). CI/CD systems, shell scripts, and monitoring wrappers all gate on exit codes, not on output text.

Exit Code Conventions Every Ops Tool Must Follow

Exit codes are the Unix API between programs. A tool that exits 0 on failure is broken by definition — any shell script, pipeline stage, or monitoring check that calls it will silently move on. The conventions are simple and non-negotiable:

  • 0 — success, everything worked as expected.
  • 1 — general operational failure (service unreachable, validation failed, resource not found).
  • 2 — misuse of the CLI itself (wrong arguments, missing required flag). argparse and click exit 2 automatically for argument errors — do not override this.
  • 130 — script interrupted by the user (Ctrl+C, SIGINT). Catch KeyboardInterrupt and exit 130 so the shell knows the user stopped it intentionally.

Never use exit codes above 125 for application-level errors — those codes are reserved by the shell for signal termination. Never exit non-zero after successfully completing the task even if a warning was printed.

click: The Production-Grade Choice for Multi-Command CLIs

click (Command Line Interface Creation Kit) composes commands, subcommands, and option groups in a way that argparse cannot match cleanly. Any internal tool that grows beyond one verb — ops deploy, ops rollback, ops status — should be built on click. It also integrates naturally with rich for colored terminal output, which matters when you are reading a wall of text in a dark terminal window at 02:00.

#!/usr/bin/env python3 """ops-cli: multi-command internal infrastructure CLI built with click.""" import sys import click # ── top-level group ────────────────────────────────────────────────────────── @click.group() @click.version_option(version="1.0.0", prog_name="ops-cli") def cli() -> None: """Internal ops automation CLI. Run a subcommand with --help for details.""" # ── deploy subcommand ──────────────────────────────────────────────────────── @cli.command() @click.argument("service") @click.option("--env", "-e", type=click.Choice(["staging", "production"], case_sensitive=False), required=True, help="Target environment") @click.option("--image-tag", default="latest", show_default=True, help="Docker image tag to deploy") @click.option("--dry-run", is_flag=True, help="Print what would happen without making changes") @click.pass_context def deploy(ctx: click.Context, service: str, env: str, image_tag: str, dry_run: bool) -> None: """Deploy SERVICE to the target environment.""" if env == "production" and not dry_run: # Require explicit confirmation for production click.confirm( f"Deploy {service}:{image_tag} to PRODUCTION?", abort=True, # raises click.Abort -> exits 1 on 'n' ) action = "[DRY-RUN] Would deploy" if dry_run else "Deploying" click.echo(f"{action} {service}:{image_tag} -> {env}") if not dry_run: # Real deploy logic here (kubectl set image, etc.) click.secho(" Deploy complete.", fg="green", bold=True) sys.exit(0) # ── status subcommand ──────────────────────────────────────────────────────── @cli.command() @click.argument("service") @click.option("--namespace", "-n", default="default", show_default=True, help="Kubernetes namespace") def status(service: str, namespace: str) -> None: """Show running status for SERVICE.""" click.echo(f"Checking {service} in namespace {namespace} ...") # Real implementation: kubectl / boto3 calls click.secho(" Running (3/3 replicas ready)", fg="green") sys.exit(0) if __name__ == "__main__": cli()
Pro practice — use click.secho() for colored output, but only when writing to a terminal. click automatically disables color when stdout is not a TTY (e.g. when the output is piped to grep or redirected to a file). Never use ANSI escape codes manually — they pollute piped output and break log parsers. The same applies to progress bars: use click.progressbar() or rich.progress.Progress, not hand-rolled carriage-return tricks.

CLI Architecture Diagram

CLI tool internal architecture with click ops-cli Architecture (click multi-command) User / CI Step ops-cli deploy svc -e prod @click.group() cli() Version, global opts, --help deploy --env --image-tag --dry-run confirmation status --namespace SERVICE arg kubectl / boto3 calls rollback --revision SERVICE arg + confirm for prod sys.exit(code) 0 = success 1 = failure 2 = bad args 130 = Ctrl+C
A click multi-command CLI: the root group handles global flags and routing; each subcommand owns its arguments and exits with a meaningful code.

Packaging: From Script to Installable Command

A script becomes a real tool when you can install it with pip install . and run it as ops-cli from anywhere on the system — no python script.py prefix, no path juggling. This is done through pyproject.toml entry points, which you already saw in Lesson 1. The directory layout for a small CLI tool follows the src/ layout, which prevents the package from being accidentally imported from the project root without installing it first.

# Project layout for ops-cli ops-cli/ ├── pyproject.toml ├── README.md ├── src/ │ └── ops_cli/ │ ├── __init__.py │ ├── main.py # defines cli() with @click.group() │ ├── deploy.py # deploy subcommand │ └── status.py # status subcommand └── tests/ ├── test_deploy.py └── test_status.py # pyproject.toml — the critical [project.scripts] section [project] name = "ops-cli" version = "1.0.0" requires-python = ">=3.11" dependencies = ["click>=8.1", "rich>=13.7"] [project.scripts] ops-cli = "ops_cli.main:cli" # installs the 'ops-cli' binary in PATH # Install in editable mode during development (changes take effect immediately) pip install -e . # Install from a release tag in CI or on a colleague's machine pip install git+https://github.com/your-org/ops-cli.git@v1.0.0 # Verify the installed entry point works ops-cli --help ops-cli --version
Production pitfall — never ship a tool with hardcoded environment names or account IDs. A tool installed on 20 engineers' laptops must read its configuration from environment variables or a config file, never from literals in the source code. Use os.environ.get("OPS_ENVIRONMENT", "staging") with a safe default, or accept the value as a CLI option (preferred, because it is explicit and auditable in shell history). Hardcoded prod account IDs are a P0 incident waiting to happen — someone runs a "test" command against the wrong target.

Testing CLI Tools with click.testing

click ships a CliRunner that invokes your CLI in-process without spawning a subprocess. This is the correct way to unit-test CLI commands — it captures stdout, stderr, and the exit code, and it works inside pytest with zero extra setup. Always test at least: the happy path, a missing-required-argument path (expect exit 2), and a failure path (expect exit 1 with a message on stderr).

# tests/test_cli.py from click.testing import CliRunner from ops_cli.main import cli def test_deploy_dry_run_exits_zero(): runner = CliRunner() result = runner.invoke(cli, ["deploy", "api-gateway", "--env", "staging", "--dry-run"]) assert result.exit_code == 0 assert "DRY-RUN" in result.output def test_deploy_missing_env_exits_two(): runner = CliRunner() result = runner.invoke(cli, ["deploy", "api-gateway"]) assert result.exit_code == 2 # argparse/click usage error assert "Missing option" in result.output def test_status_shows_service_name(): runner = CliRunner() result = runner.invoke(cli, ["status", "payment-service"]) assert result.exit_code == 0 assert "payment-service" in result.output # Run: pytest tests/ -v
Pro practice — ship a --output json flag on every command that produces structured data. Human-readable tables are fine for interactive use, but CI jobs and other scripts that call your tool need machine-parseable output. A single --output option that switches between table (default) and json makes your tool a first-class pipeline citizen. Use Python's json.dumps(data, indent=2) and always write it to stdout (not stderr) so callers can capture it with $(ops-cli status svc --output json).

Production Checklist for Every Ops CLI

Before handing a CLI to the rest of the team, verify these points. At companies like Stripe and Cloudflare, internal tooling goes through a brief review that checks exactly this list before it gets added to the shared developer environment:

  • --help on every command and subcommand — auto-generated by click if you write docstrings.
  • --version on the root group — use @click.version_option(); include this in every bug report template.
  • Meaningful exit codes on every code path — test with echo $? after each scenario.
  • No hardcoded credentials, account IDs, or hostnames — all configuration from environment variables or a config file.
  • Structured logging to stderr, results to stdout — stderr is for diagnostics; stdout is for machine-readable output. Never mix them.
  • Idempotent destructive operations — running ops-cli deploy twice should not cause an incident.
  • A --dry-run flag on every mutating command — this is the single best safety net for ops tools.