Shell Scripting & Automation

Project: A Production Backup Script

35 min Lesson 10 of 28

Project: A Production Backup Script

Everything you have built across this tutorial — strict mode, variables, conditionals, loops, functions, redirection, text processing, error handling, and argument parsing — converges here. This capstone project builds a production-grade backup script from scratch, the kind you would find running nightly on database servers at mid-to-large engineering organizations. Backup scripts are deceptively simple to get wrong: they appear to work until the moment you actually need a restore, at which point you discover the archive was corrupt, the retention logic deleted everything, or the failure emails were swallowed silently.

We will build backup.sh in layers, explaining the production rationale at each step.

Layer 1: Safety Harness and Configuration Block

Every production backup script starts with the same two lines and a readonly configuration block. The configuration block must be the single source of truth — no magic strings buried deep in the logic.

#!/usr/bin/env bash
# =============================================================
# backup.sh — Production database + files backup
# Usage: ./backup.sh [--dry-run] [--verbose]
# Cron:  0 2 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1
# =============================================================
set -euo pipefail
IFS=$'\n\t'

# --- Configuration (override via environment variables) ------
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly TIMESTAMP="$(date +%Y-%m-%d_%H-%M-%S)"

readonly BACKUP_ROOT="${BACKUP_ROOT:-/var/backups/myapp}"
readonly DB_NAME="${DB_NAME:-myapp_production}"
readonly DB_USER="${DB_USER:-backup_user}"
readonly DB_PASS_FILE="${DB_PASS_FILE:-/etc/myapp/db-backup.pass}"
readonly FILES_DIR="${FILES_DIR:-/var/www/myapp/storage}"
readonly RETENTION_DAYS="${RETENTION_DAYS:-30}"
readonly MIN_FREE_GB="${MIN_FREE_GB:-10}"
readonly SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"
readonly LOG_FILE="${BACKUP_ROOT}/logs/backup_${TIMESTAMP}.log"

DRY_RUN=false
VERBOSE=false

Key idea — configuration at the top, secrets from files: Every tunable value lives in the readonly block. The database password is never hardcoded — it is read from a root-owned, mode-600 file (DB_PASS_FILE). This satisfies secret-scanning tools like GitGuardian and truffleHog, and matches the pattern used by PostgreSQL's .pgpass and MySQL's --defaults-extra-file.

Layer 2: Logging, Notification and Cleanup

A backup that fails silently is worse than no backup. Every log line must carry a timestamp, a severity level, and the script name so grep can find it instantly in a log aggregator like Datadog or Splunk. The trap on EXIT ensures partial archives are removed even if the script is killed mid-write.

_log() {
    local level="$1"; shift
    local msg="[${SCRIPT_NAME}] [${level}] $(date +%Y-%m-%dT%H:%M:%S) $*"
    echo "$msg" | tee -a "$LOG_FILE"
}
log_info()  { _log "INFO " "$@"; }
log_warn()  { _log "WARN " "$@" >&2; }
log_error() { _log "ERROR" "$@" >&2; }
log_debug() { $VERBOSE && _log "DEBUG" "$@" || true; }

die() {
    log_error "$*"
    notify_slack ":x: *Backup FAILED* on $(hostname): $*"
    exit 1
}

notify_slack() {
    [[ -z "$SLACK_WEBHOOK" ]] && return 0
    local payload
    payload="$(printf '{"text": "%s"}' "$1")"
    curl -sf --max-time 10 -X POST \
        -H 'Content-Type: application/json' \
        -d "$payload" "$SLACK_WEBHOOK" >/dev/null \
        || log_warn "Slack notification failed (non-fatal)"
}

# CURRENT_ARCHIVE is set later; cleanup removes it if the script dies mid-write
CURRENT_ARCHIVE=""

cleanup() {
    local exit_code=$?
    if [[ -n "$CURRENT_ARCHIVE" && -f "$CURRENT_ARCHIVE" && $exit_code -ne 0 ]]; then
        log_warn "Removing incomplete archive: $CURRENT_ARCHIVE"
        rm -f "$CURRENT_ARCHIVE"
    fi
    exit $exit_code
}

trap cleanup EXIT
trap 'die "Received SIGINT"'  INT
trap 'die "Received SIGTERM"' TERM

Layer 3: Preflight Checks

Fail fast and loud before touching any data. Preflight checks prevent the silent partial failures that give you an empty archive with a 0 exit code.

require_command() {
    command -v "$1" >/dev/null 2>&1 || die "Required command not found: $1"
}

check_free_disk() {
    local dir="$1"
    local required_gb="$2"
    local free_gb
    free_gb=$(df -BG "$dir" | awk 'NR==2 {gsub(/G/,""); print $4}')
    (( free_gb >= required_gb )) \
        || die "Insufficient disk on $dir: ${free_gb}G free, need ${required_gb}G"
    log_info "Disk OK: ${free_gb}G free on $dir"
}

preflight() {
    log_info "--- Preflight checks ---"
    require_command mysqldump
    require_command gzip
    require_command tar
    require_command openssl

    [[ -f "$DB_PASS_FILE" ]] \
        || die "DB password file not found: $DB_PASS_FILE"
    [[ "$(stat -c %a "$DB_PASS_FILE")" == "600" ]] \
        || die "DB password file has unsafe permissions (must be 600): $DB_PASS_FILE"

    mkdir -p "${BACKUP_ROOT}/db" "${BACKUP_ROOT}/files" "${BACKUP_ROOT}/logs"
    check_free_disk "$BACKUP_ROOT" "$MIN_FREE_GB"

    log_info "Preflight passed"
}

Production pitfall — unchecked permissions on credential files: Many teams create the password file but forget to lock it down. If backup_user.pass is world-readable, any process on the host can read your database password. The stat -c %a check enforces mode 600 at runtime, so a misconfigured deploy fails loudly rather than silently leaking credentials.

Layer 4: Backup Functions

Separate the database dump from the file archive. This lets each step be retried, tested, or monitored independently. The --single-transaction flag for MySQL is critical in production: it takes a consistent snapshot without acquiring table locks, so your application keeps serving traffic during the backup window.

backup_database() {
    log_info "--- Database backup ---"
    local db_archive="${BACKUP_ROOT}/db/db_${TIMESTAMP}.sql.gz"
    CURRENT_ARCHIVE="$db_archive"     # register for cleanup trap

    local db_pass
    db_pass="$(cat "$DB_PASS_FILE")"

    if $DRY_RUN; then
        log_info "[DRY-RUN] Would dump $DB_NAME to $db_archive"
        return 0
    fi

    # --single-transaction: consistent snapshot without table locks (InnoDB)
    # --routines --events: include stored procedures and event scheduler objects
    # MYSQL_PWD avoids the password appearing in `ps aux` output
    MYSQL_PWD="$db_pass" mysqldump \
        --user="$DB_USER" \
        --single-transaction \
        --routines \
        --events \
        --quick \
        "$DB_NAME" \
        | gzip -9 > "$db_archive"

    # Verify the archive is non-empty and passes gzip integrity check
    gzip -t "$db_archive" || die "DB archive failed integrity check: $db_archive"
    local size_mb
    size_mb=$(du -m "$db_archive" | awk '{print $1}')
    log_info "DB archive OK: $db_archive (${size_mb}MB)"
    CURRENT_ARCHIVE=""   # clear: archive is complete
}

backup_files() {
    log_info "--- Files backup ---"
    local files_archive="${BACKUP_ROOT}/files/files_${TIMESTAMP}.tar.gz"
    CURRENT_ARCHIVE="$files_archive"

    if $DRY_RUN; then
        log_info "[DRY-RUN] Would archive $FILES_DIR to $files_archive"
        return 0
    fi

    # --exclude caches and temp directories; preserve permissions
    tar -czf "$files_archive" \
        --exclude="$FILES_DIR/framework/cache" \
        --exclude="$FILES_DIR/framework/sessions" \
        --exclude="$FILES_DIR/logs" \
        -C "$(dirname "$FILES_DIR")" \
        "$(basename "$FILES_DIR")"

    tar -tzf "$files_archive" >/dev/null || die "Files archive failed integrity check: $files_archive"
    local size_mb
    size_mb=$(du -m "$files_archive" | awk '{print $1}')
    log_info "Files archive OK: $files_archive (${size_mb}MB)"
    CURRENT_ARCHIVE=""
}

Layer 5: Retention Enforcement

Retention logic is where backup scripts most often destroy themselves. The find -mtime +N -delete idiom is the industry standard; avoid hand-rolling loops that compare dates as strings — they break at month and year boundaries.

enforce_retention() {
    log_info "--- Enforcing ${RETENTION_DAYS}-day retention ---"

    local deleted_count=0
    while IFS= read -r -d '' old_file; do
        if $DRY_RUN; then
            log_info "[DRY-RUN] Would delete: $old_file"
        else
            rm -f "$old_file"
            log_info "Deleted old backup: $old_file"
        fi
        (( deleted_count++ )) || true
    done < <(find "$BACKUP_ROOT" \
        \( -name "db_*.sql.gz" -o -name "files_*.tar.gz" \) \
        -mtime "+${RETENTION_DAYS}" \
        -print0)

    log_info "Retention sweep complete: removed ${deleted_count} old archive(s)"
}

Pro practice — use -print0 with read -d '': Filenames can legally contain spaces and newlines. The find -print0 / read -d '' pair uses the null byte as a delimiter, which can never appear in a filename. This is the Google Shell Style Guide's mandatory pattern for iterating over find output safely.

Layer 6: Argument Parsing and main()

The script accepts --dry-run and --verbose flags. Dry-run mode lets operators validate the backup plan in a new environment without touching any data or wasting disk space.

parse_args() {
    while [[ $# -gt 0 ]]; do
        case "$1" in
            --dry-run)   DRY_RUN=true  ; shift ;;
            --verbose)   VERBOSE=true  ; shift ;;
            --help|-h)
                echo "Usage: $SCRIPT_NAME [--dry-run] [--verbose]"
                exit 0
                ;;
            *) die "Unknown argument: $1" ;;
        esac
    done
}

main() {
    parse_args "$@"

    # Ensure log directory exists before any tee calls
    mkdir -p "$(dirname "$LOG_FILE")"

    log_info "========================================="
    log_info "Backup started (host: $(hostname), user: $(whoami))"
    $DRY_RUN && log_warn "DRY-RUN mode: no files will be written"

    preflight
    backup_database
    backup_files
    enforce_retention

    local db_size files_size
    db_size=$(du -sh "${BACKUP_ROOT}/db/db_${TIMESTAMP}.sql.gz" 2>/dev/null | awk '{print $1}' || echo "N/A")
    files_size=$(du -sh "${BACKUP_ROOT}/files/files_${TIMESTAMP}.tar.gz" 2>/dev/null | awk '{print $1}' || echo "N/A")

    log_info "Backup complete. DB: $db_size  Files: $files_size"
    log_info "========================================="

    notify_slack ":white_check_mark: *Backup succeeded* on $(hostname). DB: ${db_size}, Files: ${files_size}"
}

if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
    main "$@"
fi

Execution flow of the production backup script. The trap safety net fires on any exit path, ensuring partial archives never accumulate.

Running and Scheduling

Make the script executable and test it in dry-run mode before scheduling:

# One-time setup
chmod +x /opt/scripts/backup.sh
chown root:root /opt/scripts/backup.sh
chmod 700 /opt/scripts/backup.sh

# Validate without writing data
/opt/scripts/backup.sh --dry-run --verbose

# Real run (test before adding to cron)
/opt/scripts/backup.sh --verbose

# Add to root's crontab — runs at 02:00 every day
# Append BOTH stdout and stderr to a persistent log
crontab -e
# Add this line:
# 0 2 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1

# Monitor recent runs
grep "Backup complete\|Backup FAILED" /var/log/backup.log | tail -20

Pro practice — verify restores on a schedule: A backup that has never been restored is a hypothesis, not a guarantee. At companies like Stripe, automated restore tests run weekly in a staging environment: spin up a temporary database, load the most recent dump, run a smoke query, then tear it down. This is the only reliable proof that your backup is usable.

This script embodies every lesson in the tutorial: set -euo pipefail catches silent failures, trap handles unexpected exits, functions isolate concerns, readonly prevents accidental mutation, --dry-run enables safe validation, and structured logging makes every run auditable. That is production-grade shell scripting.