Linux System Administration

Project: Harden & Operate a Production Server

45 min Lesson 10 of 28

Project: Harden & Operate a Production Server

Everything you have learned in this tutorial — systemd, journald, storage, monitoring, performance analysis, networking, SSH hardening, and scheduled tasks — converges here. This capstone project walks you through standing up a freshly provisioned Ubuntu 24.04 LTS server from "root login allowed, all defaults" to a production-grade, hardened, observable, self-maintaining system. Every step mirrors what senior SREs do on day one of a new server at a real company.

What this project covers: We will configure a non-root admin account with locked-down SSH, harden the kernel and service surface, set up a sample application service under systemd with resource limits, configure disk monitoring with automated alerts, establish centralized log collection, and wire up a full cron-based maintenance schedule — then verify every layer.

Phase 1: Initial Access and User Setup

You receive a cloud instance with only root access. The first action is creating a dedicated admin account with sudo rights, hardening SSH, and disabling root login entirely.

# On the fresh server as root: # 1. Create an operations user useradd -m -s /bin/bash -G sudo opsadmin passwd opsadmin # Set a strong passphrase — only needed for sudo prompts # 2. Lock the root password (sudo is the only escalation path) passwd -l root # 3. Copy your public key to the new account mkdir -p /home/opsadmin/.ssh chmod 700 /home/opsadmin/.ssh # Paste the public key from your workstation: # cat ~/.ssh/id_ed25519.pub (run this on your LOCAL machine first) echo "ssh-ed25519 AAAA...your-public-key... ops@workstation" \ >> /home/opsadmin/.ssh/authorized_keys chmod 600 /home/opsadmin/.ssh/authorized_keys chown -R opsadmin:opsadmin /home/opsadmin/.ssh

Phase 2: SSH Hardening

Edit /etc/ssh/sshd_config with the following block. Every line here exists in production configurations at every tier-1 tech company — they are not optional paranoia, they are standard baseline:

# /etc/ssh/sshd_config — replace or append these directives Port 2222 # Non-default port reduces automated scanner noise Protocol 2 # SSHv1 is broken; enforce v2 only PermitRootLogin no # Never allow direct root SSH PasswordAuthentication no # Key-only auth; passwords are brute-forceable PubkeyAuthentication yes AuthorizedKeysFile .ssh/authorized_keys MaxAuthTries 3 # Lock out after 3 failed attempts LoginGraceTime 30 # Kill unauthenticated connections after 30 s ClientAliveInterval 300 # Server sends keepalive every 5 min ClientAliveCountMax 2 # Drop after 2 missed keepalives (10 min total idle) AllowUsers opsadmin # Whitelist — only explicitly named users can log in X11Forwarding no # No GUI tunnelling on a server AllowTcpForwarding no # Prevent using SSH as a SOCKS proxy (unless needed) Banner /etc/ssh/ssh_banner # Show a legal warning before authentication
# Create a legal banner (shown BEFORE authentication — a liability requirement at many companies) cat > /etc/ssh/ssh_banner <<'EOF' *********************************************************************** Authorised access only. All connections are monitored and logged. Disconnect immediately if you are not an authorised user. *********************************************************************** EOF # Validate config before restarting (critical step — a typo here locks you out) sshd -t echo "Exit code: $?" # Must be 0 # Apply systemctl restart ssh # Open the non-default port BEFORE you disconnect (fail-safe: keep old session open) ufw allow 2222/tcp ufw enable
Production pitfall — always validate before restarting sshd: Running sshd -t performs a dry-run config parse. A typo in sshd_config will exit non-zero; if you systemctl restart ssh without this check, sshd will refuse to start and you will be locked out of the box — requiring a recovery console or cloud instance serial access. This is one of the most common self-inflicted outages among junior engineers.

Phase 3: Kernel and OS Hardening

Harden the kernel network stack and disable unused services. These settings are taken directly from the CIS Ubuntu 24.04 LTS Benchmark — the reference standard used at Google, Meta, and every major financial institution:

# /etc/sysctl.d/99-hardening.conf # Load with: sysctl --system (or reboot) # --- Network hardening --- net.ipv4.ip_forward = 0 # Not a router; disable packet forwarding net.ipv4.conf.all.rp_filter = 1 # Reverse-path filtering (prevent IP spoofing) net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.all.accept_redirects = 0 # Ignore ICMP redirects (MITM vector) net.ipv4.conf.default.accept_redirects = 0 net.ipv6.conf.all.accept_redirects = 0 net.ipv4.conf.all.accept_source_route = 0 # Ignore source-routed packets net.ipv4.icmp_echo_ignore_broadcasts = 1 # Ignore broadcast pings (smurf amplification) net.ipv4.tcp_syncookies = 1 # SYN flood protection # --- Kernel hardening --- kernel.randomize_va_space = 2 # Full ASLR (address space layout randomisation) kernel.dmesg_restrict = 1 # Non-root users cannot read kernel ring buffer kernel.sysrq = 0 # Disable SysRq (not needed on a headless server) fs.protected_hardlinks = 1 fs.protected_symlinks = 1
# Apply sysctl settings immediately (no reboot required) sysctl --system # Disable and mask unused services (attack surface reduction) for svc in avahi-daemon cups bluetooth ModemManager; do systemctl disable --now "$svc" 2>/dev/null systemctl mask "$svc" 2>/dev/null done # Verify nothing unexpected is listening ss -tulnp | grep -v '127.0.0.1\|::1' # Show non-loopback listeners only

Phase 4: Deploy and Manage a Service with systemd

A production service must run under a dedicated, unprivileged user, have hard resource limits, restart automatically on failure, and ship logs through journald. This example deploys a simple Python health-check API — the pattern applies to any daemon:

# Create an unprivileged service account (no login shell, no home directory) useradd -r -s /sbin/nologin -d /opt/healthapi healthapi # Install a minimal app mkdir -p /opt/healthapi cat > /opt/healthapi/server.py <<'EOF' from http.server import HTTPServer, BaseHTTPRequestHandler import json, time class Handler(BaseHTTPRequestHandler): def log_message(self, fmt, *args): print(fmt % args, flush=True) # journald captures stdout def do_GET(self): if self.path == '/health': self.send_response(200) self.send_header('Content-Type','application/json') self.end_headers() self.wfile.write(json.dumps({'status':'ok','ts':time.time()}).encode()) else: self.send_response(404) self.end_headers() HTTPServer(('127.0.0.1', 8080), Handler).serve_forever() EOF chown -R healthapi:healthapi /opt/healthapi
# /etc/systemd/system/healthapi.service [Unit] Description=Health Check API Documentation=https://wiki.internal/healthapi After=network.target # Restart if the service dies; systemd will manage the lifecycle [Service] Type=simple User=healthapi Group=healthapi WorkingDirectory=/opt/healthapi ExecStart=/usr/bin/python3 /opt/healthapi/server.py Restart=on-failure RestartSec=5s StandardOutput=journal # All stdout -> journald (searchable, rotated automatically) StandardError=journal # Resource limits (prevent a runaway process from killing the host) LimitNOFILE=65536 # Max open file descriptors MemoryMax=256M # Kill the service if it exceeds 256 MB CPUQuota=20% # Never consume more than 20 % of one CPU # Security sandboxing NoNewPrivileges=yes PrivateTmp=yes # Service gets its own /tmp (not shared with the system) ProtectSystem=strict ReadWritePaths=/opt/healthapi [Install] WantedBy=multi-user.target
# Load, enable, start, verify systemctl daemon-reload systemctl enable --now healthapi systemctl status healthapi # Tail the service log in real time journalctl -u healthapi -f # Hit the endpoint from localhost to confirm it is alive curl -s http://127.0.0.1:8080/health

Phase 5: Disk Monitoring and Automated Alerts

Disk-full incidents cause silent data corruption, database crashes, and application hangs — all without a clear error message. Set up automated monitoring to catch this before it happens:

# /usr/local/bin/disk-alert.sh #!/usr/bin/env bash # Send an alert (to a log file + journald) if any filesystem exceeds THRESHOLD % THRESHOLD=80 ALERT_LOG=/var/log/disk-alert.log df -h --output=target,pcent | tail -n +2 | while read -r mount pct; do usage="${pct%\%}" # Strip the % sign if [[ "$usage" -ge "$THRESHOLD" ]]; then msg="[DISK ALERT] $(date -Iseconds) ${mount} is at ${pct} (threshold: ${THRESHOLD}%)" echo "$msg" | tee -a "$ALERT_LOG" # In production: replace echo with curl to PagerDuty/Slack webhook, or sendmail # Example: curl -s -X POST "$SLACK_WEBHOOK" -d "{\"text\":\"$msg\"}" logger -t disk-alert -p user.crit "$msg" # Also sends to journald via syslog fi done
chmod +x /usr/local/bin/disk-alert.sh # Wire it into cron AND a systemd timer for belt-and-suspenders coverage # Cron approach (runs every 15 minutes): echo "*/15 * * * * root /usr/local/bin/disk-alert.sh" \ > /etc/cron.d/disk-alert # Verify cron picked it up (look for syntax errors): crontab -l -u root 2>/dev/null run-parts --test /etc/cron.d/ # Test manually right now: /usr/local/bin/disk-alert.sh && echo "Script executed cleanly"

Phase 6: Log Management and Audit Trail

Configure journald for persistent storage and sensible retention, then wire up logrotate for any plain-text logs your scripts write:

# /etc/systemd/journald.conf.d/99-production.conf [Journal] Storage=persistent # Survive reboots (default is "auto", which is volatile if /run only) Compress=yes SystemMaxUse=2G # Journals never consume more than 2 GB on disk SystemKeepFree=500M # Always leave 500 MB free MaxRetentionSec=90day # Auto-prune entries older than 90 days MaxFileSec=1week # Rotate individual journal files weekly ForwardToSyslog=no # Do not double-write to rsyslog (saves I/O)
# Restart journald to apply systemctl restart systemd-journald # Inspect current journal disk usage journalctl --disk-usage # Manually vacuum to target size (useful after changing limits): journalctl --vacuum-size=1G journalctl --vacuum-time=90d # logrotate config for our custom alert log cat > /etc/logrotate.d/disk-alert <<'EOF' /var/log/disk-alert.log { daily rotate 30 compress delaycompress missingok notifempty create 640 root adm } EOF # Test logrotate config (dry-run) logrotate -d /etc/logrotate.d/disk-alert

Phase 7: Scheduled Maintenance Tasks

Every production server needs a set of recurring housekeeping jobs. We use both cron (for simple commands) and systemd timers (for jobs that need dependency tracking or resource limits):

# /etc/cron.d/server-maintenance # All times are UTC (servers must always run in UTC — local time zones are a source of bugs) # Security patches: unattended daily (only security updates, not all updates) 0 3 * * * root unattended-upgrades -d >> /var/log/unattended-upgrades/unattended-upgrades.log 2>&1 # Weekly full package update (review before applying on critical systems) 0 4 * * 0 root apt-get update -qq && apt-get upgrade -y -q >> /var/log/apt-weekly.log 2>&1 # Daily disk check */15 * * * * root /usr/local/bin/disk-alert.sh # Remove old temp files older than 7 days (prevent /tmp from filling up) 0 2 * * * root find /tmp -type f -mtime +7 -delete 2>/dev/null # Verify the healthapi service is responding (lightweight uptime probe) */5 * * * * root curl -sf http://127.0.0.1:8080/health >/dev/null || systemctl restart healthapi

Phase 8: Final Verification Checklist

Before declaring a server production-ready, run through this verification sequence. This is the equivalent of a pre-flight checklist — every item must pass:

Production server hardening checklist layers Control Layer Verification Command SSH Hardening Port 2222, key-only, no root, whitelist sshd -T | grep -E 'permitroot|passauth' ss -tlnp | grep sshd Minimal Service Surface Only required services running systemctl list-units --state=running ss -tulnp | grep -v 127 Kernel Hardening sysctl settings applied and persistent sysctl net.ipv4.tcp_syncookies sysctl kernel.randomize_va_space Application Service Unprivileged user, resource limits, restarts systemctl status healthapi curl http://127.0.0.1:8080/health Disk Monitoring & Logging Alerts, rotation, journal limits /usr/local/bin/disk-alert.sh journalctl --disk-usage Scheduled Tasks Cron jobs, unattended upgrades, probes crontab -l -u root systemctl list-timers Firewall (UFW) Only port 2222 (SSH) open externally ufw status verbose nmap -sV -p 1-1024 <server-ip>
The seven control layers of a hardened production server, with their corresponding verification commands.
# Run all verification checks in sequence — each should return the expected value echo "=== SSH config ===" sshd -T | grep -E "permitrootlogin|passwordauthentication|port|allowusers" echo "=== Listening ports (external) ===" ss -tulnp | grep -v '127.0.0.1\|::1' echo "=== Kernel sysctl (key values) ===" sysctl net.ipv4.tcp_syncookies net.ipv4.conf.all.rp_filter kernel.randomize_va_space echo "=== healthapi service ===" systemctl is-active healthapi && curl -sf http://127.0.0.1:8080/health echo "=== Journal disk usage ===" journalctl --disk-usage echo "=== Cron jobs ===" crontab -l -u root echo "=== Firewall ===" ufw status numbered echo "=== All services (running) ===" systemctl list-units --type=service --state=running --no-pager
Pro tip — infrastructure as code: Once you have built and verified this server manually, capture every configuration decision in Ansible playbooks or a Terraform + cloud-init template. The next server should take under ten minutes to reproduce identically. At Google and Meta, no production server is configured by hand — everything is version-controlled, code-reviewed, and applied by automation. Manual configuration is technical debt that silently accumulates until someone is paged at 2 AM by a snowflake server that behaves differently from every other host in the fleet.

What You Have Built

At this point you have a server that is aligned with production practice at any top-tier company:

  • Identity: No root SSH, key-only auth, dedicated unprivileged user, legal banner.
  • Network surface: Firewall default-deny; only port 2222 open; kernel hardened against common network attacks.
  • Service hygiene: Application runs as a least-privilege service account under systemd, with memory and CPU caps, automatic restart, and full journald logging.
  • Observability: Persistent journal with retention limits; disk alerts wired to cron every 15 minutes.
  • Self-maintenance: Unattended security updates, weekly package upgrades, temp-file cleanup, and automated service recovery all run on schedule — the server heals itself for common failure modes without on-call intervention.
Next steps in a real organisation: This project covers the OS-level baseline. In practice you would layer on: centralised log shipping (Loki, Splunk, or CloudWatch Logs), host-based intrusion detection (AIDE, Falco), external uptime monitoring (Prometheus Blackbox Exporter or a SaaS probe), secrets management (Vault), and configuration management automation (Ansible, Chef, or Puppet) so the entire setup is reproducible in code. Those topics are covered in later tutorials in this DevOps path.