Linux System Administration

Project: Harden & Operate a Production Server

45 min Lesson 10 of 28

Project: Harden & Operate a Production Server

Everything you have learned in this tutorial — systemd, journald, storage, monitoring, performance analysis, networking, SSH hardening, and scheduled tasks — converges here. This capstone project walks you through standing up a freshly provisioned Ubuntu 24.04 LTS server from "root login allowed, all defaults" to a production-grade, hardened, observable, self-maintaining system. Every step mirrors what senior SREs do on day one of a new server at a real company.

What this project covers: We will configure a non-root admin account with locked-down SSH, harden the kernel and service surface, set up a sample application service under systemd with resource limits, configure disk monitoring with automated alerts, establish centralized log collection, and wire up a full cron-based maintenance schedule — then verify every layer.

Phase 1: Initial Access and User Setup

You receive a cloud instance with only root access. The first action is creating a dedicated admin account with sudo rights, hardening SSH, and disabling root login entirely.

# On the fresh server as root:

# 1. Create an operations user
useradd -m -s /bin/bash -G sudo opsadmin
passwd opsadmin                 # Set a strong passphrase — only needed for sudo prompts

# 2. Lock the root password (sudo is the only escalation path)
passwd -l root

# 3. Copy your public key to the new account
mkdir -p /home/opsadmin/.ssh
chmod 700 /home/opsadmin/.ssh
# Paste the public key from your workstation:
# cat ~/.ssh/id_ed25519.pub  (run this on your LOCAL machine first)
echo "ssh-ed25519 AAAA...your-public-key... ops@workstation" \
  >> /home/opsadmin/.ssh/authorized_keys
chmod 600 /home/opsadmin/.ssh/authorized_keys
chown -R opsadmin:opsadmin /home/opsadmin/.ssh

Phase 2: SSH Hardening

Edit /etc/ssh/sshd_config with the following block. Every line here exists in production configurations at every tier-1 tech company — they are not optional paranoia, they are standard baseline:

# /etc/ssh/sshd_config  — replace or append these directives

Port 2222                          # Non-default port reduces automated scanner noise
Protocol 2                         # SSHv1 is broken; enforce v2 only
PermitRootLogin no                 # Never allow direct root SSH
PasswordAuthentication no          # Key-only auth; passwords are brute-forceable
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3                     # Lock out after 3 failed attempts
LoginGraceTime 30                  # Kill unauthenticated connections after 30 s
ClientAliveInterval 300            # Server sends keepalive every 5 min
ClientAliveCountMax 2              # Drop after 2 missed keepalives (10 min total idle)
AllowUsers opsadmin                # Whitelist — only explicitly named users can log in
X11Forwarding no                   # No GUI tunnelling on a server
AllowTcpForwarding no              # Prevent using SSH as a SOCKS proxy (unless needed)
Banner /etc/ssh/ssh_banner         # Show a legal warning before authentication

# Create a legal banner (shown BEFORE authentication — a liability requirement at many companies)
cat > /etc/ssh/ssh_banner <<'EOF'
***********************************************************************
  Authorised access only. All connections are monitored and logged.
  Disconnect immediately if you are not an authorised user.
***********************************************************************
EOF

# Validate config before restarting (critical step — a typo here locks you out)
sshd -t
echo "Exit code: $?"         # Must be 0

# Apply
systemctl restart ssh

# Open the non-default port BEFORE you disconnect (fail-safe: keep old session open)
ufw allow 2222/tcp
ufw enable

Production pitfall — always validate before restarting sshd: Running sshd -t performs a dry-run config parse. A typo in sshd_config will exit non-zero; if you systemctl restart ssh without this check, sshd will refuse to start and you will be locked out of the box — requiring a recovery console or cloud instance serial access. This is one of the most common self-inflicted outages among junior engineers.

Phase 3: Kernel and OS Hardening

Harden the kernel network stack and disable unused services. These settings are taken directly from the CIS Ubuntu 24.04 LTS Benchmark — the reference standard used at Google, Meta, and every major financial institution:

# /etc/sysctl.d/99-hardening.conf
# Load with: sysctl --system  (or reboot)

# --- Network hardening ---
net.ipv4.ip_forward = 0                  # Not a router; disable packet forwarding
net.ipv4.conf.all.rp_filter = 1          # Reverse-path filtering (prevent IP spoofing)
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_redirects = 0   # Ignore ICMP redirects (MITM vector)
net.ipv4.conf.default.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0  # Ignore source-routed packets
net.ipv4.icmp_echo_ignore_broadcasts = 1   # Ignore broadcast pings (smurf amplification)
net.ipv4.tcp_syncookies = 1              # SYN flood protection

# --- Kernel hardening ---
kernel.randomize_va_space = 2            # Full ASLR (address space layout randomisation)
kernel.dmesg_restrict = 1               # Non-root users cannot read kernel ring buffer
kernel.sysrq = 0                        # Disable SysRq (not needed on a headless server)
fs.protected_hardlinks = 1
fs.protected_symlinks = 1

# Apply sysctl settings immediately (no reboot required)
sysctl --system

# Disable and mask unused services (attack surface reduction)
for svc in avahi-daemon cups bluetooth ModemManager; do
    systemctl disable --now "$svc" 2>/dev/null
    systemctl mask "$svc" 2>/dev/null
done

# Verify nothing unexpected is listening
ss -tulnp | grep -v '127.0.0.1\|::1'   # Show non-loopback listeners only

Phase 4: Deploy and Manage a Service with systemd

A production service must run under a dedicated, unprivileged user, have hard resource limits, restart automatically on failure, and ship logs through journald. This example deploys a simple Python health-check API — the pattern applies to any daemon:

# Create an unprivileged service account (no login shell, no home directory)
useradd -r -s /sbin/nologin -d /opt/healthapi healthapi

# Install a minimal app
mkdir -p /opt/healthapi
cat > /opt/healthapi/server.py <<'EOF'
from http.server import HTTPServer, BaseHTTPRequestHandler
import json, time

class Handler(BaseHTTPRequestHandler):
    def log_message(self, fmt, *args):
        print(fmt % args, flush=True)   # journald captures stdout
    def do_GET(self):
        if self.path == '/health':
            self.send_response(200)
            self.send_header('Content-Type','application/json')
            self.end_headers()
            self.wfile.write(json.dumps({'status':'ok','ts':time.time()}).encode())
        else:
            self.send_response(404)
            self.end_headers()

HTTPServer(('127.0.0.1', 8080), Handler).serve_forever()
EOF

chown -R healthapi:healthapi /opt/healthapi

# /etc/systemd/system/healthapi.service

[Unit]
Description=Health Check API
Documentation=https://wiki.internal/healthapi
After=network.target
# Restart if the service dies; systemd will manage the lifecycle

[Service]
Type=simple
User=healthapi
Group=healthapi
WorkingDirectory=/opt/healthapi
ExecStart=/usr/bin/python3 /opt/healthapi/server.py
Restart=on-failure
RestartSec=5s
StandardOutput=journal        # All stdout -> journald (searchable, rotated automatically)
StandardError=journal

# Resource limits (prevent a runaway process from killing the host)
LimitNOFILE=65536             # Max open file descriptors
MemoryMax=256M                # Kill the service if it exceeds 256 MB
CPUQuota=20%                  # Never consume more than 20 % of one CPU

# Security sandboxing
NoNewPrivileges=yes
PrivateTmp=yes                # Service gets its own /tmp (not shared with the system)
ProtectSystem=strict
ReadWritePaths=/opt/healthapi

[Install]
WantedBy=multi-user.target

# Load, enable, start, verify
systemctl daemon-reload
systemctl enable --now healthapi
systemctl status healthapi

# Tail the service log in real time
journalctl -u healthapi -f

# Hit the endpoint from localhost to confirm it is alive
curl -s http://127.0.0.1:8080/health

Phase 5: Disk Monitoring and Automated Alerts

Disk-full incidents cause silent data corruption, database crashes, and application hangs — all without a clear error message. Set up automated monitoring to catch this before it happens:

# /usr/local/bin/disk-alert.sh
#!/usr/bin/env bash
# Send an alert (to a log file + journald) if any filesystem exceeds THRESHOLD %

THRESHOLD=80
ALERT_LOG=/var/log/disk-alert.log

df -h --output=target,pcent | tail -n +2 | while read -r mount pct; do
    usage="${pct%\%}"       # Strip the % sign
    if [[ "$usage" -ge "$THRESHOLD" ]]; then
        msg="[DISK ALERT] $(date -Iseconds) ${mount} is at ${pct} (threshold: ${THRESHOLD}%)"
        echo "$msg" | tee -a "$ALERT_LOG"
        # In production: replace echo with curl to PagerDuty/Slack webhook, or sendmail
        # Example: curl -s -X POST "$SLACK_WEBHOOK" -d "{\"text\":\"$msg\"}"
        logger -t disk-alert -p user.crit "$msg"   # Also sends to journald via syslog
    fi
done

chmod +x /usr/local/bin/disk-alert.sh

# Wire it into cron AND a systemd timer for belt-and-suspenders coverage
# Cron approach (runs every 15 minutes):
echo "*/15 * * * * root /usr/local/bin/disk-alert.sh" \
    > /etc/cron.d/disk-alert

# Verify cron picked it up (look for syntax errors):
crontab -l -u root 2>/dev/null
run-parts --test /etc/cron.d/

# Test manually right now:
/usr/local/bin/disk-alert.sh && echo "Script executed cleanly"

Phase 6: Log Management and Audit Trail

Configure journald for persistent storage and sensible retention, then wire up logrotate for any plain-text logs your scripts write:

# /etc/systemd/journald.conf.d/99-production.conf
[Journal]
Storage=persistent            # Survive reboots (default is "auto", which is volatile if /run only)
Compress=yes
SystemMaxUse=2G               # Journals never consume more than 2 GB on disk
SystemKeepFree=500M           # Always leave 500 MB free
MaxRetentionSec=90day         # Auto-prune entries older than 90 days
MaxFileSec=1week              # Rotate individual journal files weekly
ForwardToSyslog=no            # Do not double-write to rsyslog (saves I/O)

# Restart journald to apply
systemctl restart systemd-journald

# Inspect current journal disk usage
journalctl --disk-usage

# Manually vacuum to target size (useful after changing limits):
journalctl --vacuum-size=1G
journalctl --vacuum-time=90d

# logrotate config for our custom alert log
cat > /etc/logrotate.d/disk-alert <<'EOF'
/var/log/disk-alert.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 640 root adm
}
EOF

# Test logrotate config (dry-run)
logrotate -d /etc/logrotate.d/disk-alert

Phase 7: Scheduled Maintenance Tasks

Every production server needs a set of recurring housekeeping jobs. We use both cron (for simple commands) and systemd timers (for jobs that need dependency tracking or resource limits):

# /etc/cron.d/server-maintenance
# All times are UTC (servers must always run in UTC — local time zones are a source of bugs)

# Security patches: unattended daily (only security updates, not all updates)
0 3 * * * root unattended-upgrades -d >> /var/log/unattended-upgrades/unattended-upgrades.log 2>&1

# Weekly full package update (review before applying on critical systems)
0 4 * * 0 root apt-get update -qq && apt-get upgrade -y -q >> /var/log/apt-weekly.log 2>&1

# Daily disk check
*/15 * * * * root /usr/local/bin/disk-alert.sh

# Remove old temp files older than 7 days (prevent /tmp from filling up)
0 2 * * * root find /tmp -type f -mtime +7 -delete 2>/dev/null

# Verify the healthapi service is responding (lightweight uptime probe)
*/5 * * * * root curl -sf http://127.0.0.1:8080/health >/dev/null || systemctl restart healthapi

Phase 8: Final Verification Checklist

Before declaring a server production-ready, run through this verification sequence. This is the equivalent of a pre-flight checklist — every item must pass:

The seven control layers of a hardened production server, with their corresponding verification commands.

# Run all verification checks in sequence — each should return the expected value

echo "=== SSH config ==="
sshd -T | grep -E "permitrootlogin|passwordauthentication|port|allowusers"

echo "=== Listening ports (external) ==="
ss -tulnp | grep -v '127.0.0.1\|::1'

echo "=== Kernel sysctl (key values) ==="
sysctl net.ipv4.tcp_syncookies net.ipv4.conf.all.rp_filter kernel.randomize_va_space

echo "=== healthapi service ==="
systemctl is-active healthapi && curl -sf http://127.0.0.1:8080/health

echo "=== Journal disk usage ==="
journalctl --disk-usage

echo "=== Cron jobs ==="
crontab -l -u root

echo "=== Firewall ==="
ufw status numbered

echo "=== All services (running) ==="
systemctl list-units --type=service --state=running --no-pager

Pro tip — infrastructure as code: Once you have built and verified this server manually, capture every configuration decision in Ansible playbooks or a Terraform + cloud-init template. The next server should take under ten minutes to reproduce identically. At Google and Meta, no production server is configured by hand — everything is version-controlled, code-reviewed, and applied by automation. Manual configuration is technical debt that silently accumulates until someone is paged at 2 AM by a snowflake server that behaves differently from every other host in the fleet.

What You Have Built

At this point you have a server that is aligned with production practice at any top-tier company:

Identity: No root SSH, key-only auth, dedicated unprivileged user, legal banner.
Network surface: Firewall default-deny; only port 2222 open; kernel hardened against common network attacks.
Service hygiene: Application runs as a least-privilege service account under systemd, with memory and CPU caps, automatic restart, and full journald logging.
Observability: Persistent journal with retention limits; disk alerts wired to cron every 15 minutes.
Self-maintenance: Unattended security updates, weekly package upgrades, temp-file cleanup, and automated service recovery all run on schedule — the server heals itself for common failure modes without on-call intervention.

Next steps in a real organisation: This project covers the OS-level baseline. In practice you would layer on: centralised log shipping (Loki, Splunk, or CloudWatch Logs), host-based intrusion detection (AIDE, Falco), external uptime monitoring (Prometheus Blackbox Exporter or a SaaS probe), secrets management (Vault), and configuration management automation (Ansible, Chef, or Puppet) so the entire setup is reproducible in code. Those topics are covered in later tutorials in this DevOps path.