Building CLI Tools
Building CLI Tools
At some point every ops script graduates from a one-liner you type manually to a shareable tool that teammates install and run daily. The moment that happens, a raw script breaks down: it has no --help text, fails cryptically when a required argument is missing, exits 0 even on error, and cannot be piped cleanly into other commands. A proper CLI tool solves all of these — and in the DevOps world, building CLIs is a core engineering skill. Tools like the AWS CLI, kubectl, gh, and terraform are all CLI programs that operators trust with production systems every day.
This lesson covers two approaches — the standard-library argparse for simple tools, and the third-party click for anything you expect to grow — plus exit-code conventions and the pyproject.toml packaging step that turns a script into a distributable command.
argparse: The Standard-Library Baseline
argparse is part of the Python standard library. It parses sys.argv, validates types, generates --help automatically, and raises clean errors for missing required arguments. For an internal one-off tool that will not grow beyond a single command, argparse is the right choice because it has zero dependencies.
sys.exit() explicitly. When a function returns normally, Python exits with code 0. That is wrong for an ops tool that detected failures. Every code path in main() must end with an explicit sys.exit(code). CI/CD systems, shell scripts, and monitoring wrappers all gate on exit codes, not on output text.Exit Code Conventions Every Ops Tool Must Follow
Exit codes are the Unix API between programs. A tool that exits 0 on failure is broken by definition — any shell script, pipeline stage, or monitoring check that calls it will silently move on. The conventions are simple and non-negotiable:
0— success, everything worked as expected.1— general operational failure (service unreachable, validation failed, resource not found).2— misuse of the CLI itself (wrong arguments, missing required flag).argparseandclickexit 2 automatically for argument errors — do not override this.130— script interrupted by the user (Ctrl+C, SIGINT). CatchKeyboardInterruptand exit 130 so the shell knows the user stopped it intentionally.
Never use exit codes above 125 for application-level errors — those codes are reserved by the shell for signal termination. Never exit non-zero after successfully completing the task even if a warning was printed.
click: The Production-Grade Choice for Multi-Command CLIs
click (Command Line Interface Creation Kit) composes commands, subcommands, and option groups in a way that argparse cannot match cleanly. Any internal tool that grows beyond one verb — ops deploy, ops rollback, ops status — should be built on click. It also integrates naturally with rich for colored terminal output, which matters when you are reading a wall of text in a dark terminal window at 02:00.
click.secho() for colored output, but only when writing to a terminal. click automatically disables color when stdout is not a TTY (e.g. when the output is piped to grep or redirected to a file). Never use ANSI escape codes manually — they pollute piped output and break log parsers. The same applies to progress bars: use click.progressbar() or rich.progress.Progress, not hand-rolled carriage-return tricks.CLI Architecture Diagram
Packaging: From Script to Installable Command
A script becomes a real tool when you can install it with pip install . and run it as ops-cli from anywhere on the system — no python script.py prefix, no path juggling. This is done through pyproject.toml entry points, which you already saw in Lesson 1. The directory layout for a small CLI tool follows the src/ layout, which prevents the package from being accidentally imported from the project root without installing it first.
os.environ.get("OPS_ENVIRONMENT", "staging") with a safe default, or accept the value as a CLI option (preferred, because it is explicit and auditable in shell history). Hardcoded prod account IDs are a P0 incident waiting to happen — someone runs a "test" command against the wrong target.Testing CLI Tools with click.testing
click ships a CliRunner that invokes your CLI in-process without spawning a subprocess. This is the correct way to unit-test CLI commands — it captures stdout, stderr, and the exit code, and it works inside pytest with zero extra setup. Always test at least: the happy path, a missing-required-argument path (expect exit 2), and a failure path (expect exit 1 with a message on stderr).
--output json flag on every command that produces structured data. Human-readable tables are fine for interactive use, but CI jobs and other scripts that call your tool need machine-parseable output. A single --output option that switches between table (default) and json makes your tool a first-class pipeline citizen. Use Python's json.dumps(data, indent=2) and always write it to stdout (not stderr) so callers can capture it with $(ops-cli status svc --output json).Production Checklist for Every Ops CLI
Before handing a CLI to the rest of the team, verify these points. At companies like Stripe and Cloudflare, internal tooling goes through a brief review that checks exactly this list before it gets added to the shared developer environment:
--helpon every command and subcommand — auto-generated byclickif you write docstrings.--versionon the root group — use@click.version_option(); include this in every bug report template.- Meaningful exit codes on every code path — test with
echo $?after each scenario. - No hardcoded credentials, account IDs, or hostnames — all configuration from environment variables or a config file.
- Structured logging to stderr, results to stdout — stderr is for diagnostics; stdout is for machine-readable output. Never mix them.
- Idempotent destructive operations — running
ops-cli deploytwice should not cause an incident. - A
--dry-runflag on every mutating command — this is the single best safety net for ops tools.