Everything you have learned in this tutorial — systemd, journald, storage, monitoring, performance analysis, networking, SSH hardening, and scheduled tasks — converges here. This capstone project walks you through standing up a freshly provisioned Ubuntu 24.04 LTS server from "root login allowed, all defaults" to a production-grade, hardened, observable, self-maintaining system. Every step mirrors what senior SREs do on day one of a new server at a real company.
What this project covers: We will configure a non-root admin account with locked-down SSH, harden the kernel and service surface, set up a sample application service under systemd with resource limits, configure disk monitoring with automated alerts, establish centralized log collection, and wire up a full cron-based maintenance schedule — then verify every layer.
Phase 1: Initial Access and User Setup
You receive a cloud instance with only root access. The first action is creating a dedicated admin account with sudo rights, hardening SSH, and disabling root login entirely.
# On the fresh server as root:
# 1. Create an operations user
useradd -m -s /bin/bash -G sudo opsadmin
passwd opsadmin # Set a strong passphrase — only needed for sudo prompts
# 2. Lock the root password (sudo is the only escalation path)
passwd -l root
# 3. Copy your public key to the new account
mkdir -p /home/opsadmin/.ssh
chmod 700 /home/opsadmin/.ssh
# Paste the public key from your workstation:
# cat ~/.ssh/id_ed25519.pub (run this on your LOCAL machine first)
echo "ssh-ed25519 AAAA...your-public-key... ops@workstation" \
>> /home/opsadmin/.ssh/authorized_keys
chmod 600 /home/opsadmin/.ssh/authorized_keys
chown -R opsadmin:opsadmin /home/opsadmin/.ssh
Phase 2: SSH Hardening
Edit /etc/ssh/sshd_config with the following block. Every line here exists in production configurations at every tier-1 tech company — they are not optional paranoia, they are standard baseline:
# /etc/ssh/sshd_config — replace or append these directives
Port 2222 # Non-default port reduces automated scanner noise
Protocol 2 # SSHv1 is broken; enforce v2 only
PermitRootLogin no # Never allow direct root SSH
PasswordAuthentication no # Key-only auth; passwords are brute-forceable
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3 # Lock out after 3 failed attempts
LoginGraceTime 30 # Kill unauthenticated connections after 30 s
ClientAliveInterval 300 # Server sends keepalive every 5 min
ClientAliveCountMax 2 # Drop after 2 missed keepalives (10 min total idle)
AllowUsers opsadmin # Whitelist — only explicitly named users can log in
X11Forwarding no # No GUI tunnelling on a server
AllowTcpForwarding no # Prevent using SSH as a SOCKS proxy (unless needed)
Banner /etc/ssh/ssh_banner # Show a legal warning before authentication
# Create a legal banner (shown BEFORE authentication — a liability requirement at many companies)
cat > /etc/ssh/ssh_banner <<'EOF'
***********************************************************************
Authorised access only. All connections are monitored and logged.
Disconnect immediately if you are not an authorised user.
***********************************************************************
EOF
# Validate config before restarting (critical step — a typo here locks you out)
sshd -t
echo "Exit code: $?" # Must be 0
# Apply
systemctl restart ssh
# Open the non-default port BEFORE you disconnect (fail-safe: keep old session open)
ufw allow 2222/tcp
ufw enable
Production pitfall — always validate before restarting sshd: Running sshd -t performs a dry-run config parse. A typo in sshd_config will exit non-zero; if you systemctl restart ssh without this check, sshd will refuse to start and you will be locked out of the box — requiring a recovery console or cloud instance serial access. This is one of the most common self-inflicted outages among junior engineers.
Phase 3: Kernel and OS Hardening
Harden the kernel network stack and disable unused services. These settings are taken directly from the CIS Ubuntu 24.04 LTS Benchmark — the reference standard used at Google, Meta, and every major financial institution:
# Apply sysctl settings immediately (no reboot required)
sysctl --system
# Disable and mask unused services (attack surface reduction)
for svc in avahi-daemon cups bluetooth ModemManager; do
systemctl disable --now "$svc" 2>/dev/null
systemctl mask "$svc" 2>/dev/null
done
# Verify nothing unexpected is listening
ss -tulnp | grep -v '127.0.0.1\|::1' # Show non-loopback listeners only
Phase 4: Deploy and Manage a Service with systemd
A production service must run under a dedicated, unprivileged user, have hard resource limits, restart automatically on failure, and ship logs through journald. This example deploys a simple Python health-check API — the pattern applies to any daemon:
# Create an unprivileged service account (no login shell, no home directory)
useradd -r -s /sbin/nologin -d /opt/healthapi healthapi
# Install a minimal app
mkdir -p /opt/healthapi
cat > /opt/healthapi/server.py <<'EOF'
from http.server import HTTPServer, BaseHTTPRequestHandler
import json, time
class Handler(BaseHTTPRequestHandler):
def log_message(self, fmt, *args):
print(fmt % args, flush=True) # journald captures stdout
def do_GET(self):
if self.path == '/health':
self.send_response(200)
self.send_header('Content-Type','application/json')
self.end_headers()
self.wfile.write(json.dumps({'status':'ok','ts':time.time()}).encode())
else:
self.send_response(404)
self.end_headers()
HTTPServer(('127.0.0.1', 8080), Handler).serve_forever()
EOF
chown -R healthapi:healthapi /opt/healthapi
# /etc/systemd/system/healthapi.service
[Unit]
Description=Health Check API
Documentation=https://wiki.internal/healthapi
After=network.target
# Restart if the service dies; systemd will manage the lifecycle
[Service]
Type=simple
User=healthapi
Group=healthapi
WorkingDirectory=/opt/healthapi
ExecStart=/usr/bin/python3 /opt/healthapi/server.py
Restart=on-failure
RestartSec=5s
StandardOutput=journal # All stdout -> journald (searchable, rotated automatically)
StandardError=journal
# Resource limits (prevent a runaway process from killing the host)
LimitNOFILE=65536 # Max open file descriptors
MemoryMax=256M # Kill the service if it exceeds 256 MB
CPUQuota=20% # Never consume more than 20 % of one CPU
# Security sandboxing
NoNewPrivileges=yes
PrivateTmp=yes # Service gets its own /tmp (not shared with the system)
ProtectSystem=strict
ReadWritePaths=/opt/healthapi
[Install]
WantedBy=multi-user.target
# Load, enable, start, verify
systemctl daemon-reload
systemctl enable --now healthapi
systemctl status healthapi
# Tail the service log in real time
journalctl -u healthapi -f
# Hit the endpoint from localhost to confirm it is alive
curl -s http://127.0.0.1:8080/health
Phase 5: Disk Monitoring and Automated Alerts
Disk-full incidents cause silent data corruption, database crashes, and application hangs — all without a clear error message. Set up automated monitoring to catch this before it happens:
# /usr/local/bin/disk-alert.sh
#!/usr/bin/env bash
# Send an alert (to a log file + journald) if any filesystem exceeds THRESHOLD %
THRESHOLD=80
ALERT_LOG=/var/log/disk-alert.log
df -h --output=target,pcent | tail -n +2 | while read -r mount pct; do
usage="${pct%\%}" # Strip the % sign
if [[ "$usage" -ge "$THRESHOLD" ]]; then
msg="[DISK ALERT] $(date -Iseconds) ${mount} is at ${pct} (threshold: ${THRESHOLD}%)"
echo "$msg" | tee -a "$ALERT_LOG"
# In production: replace echo with curl to PagerDuty/Slack webhook, or sendmail
# Example: curl -s -X POST "$SLACK_WEBHOOK" -d "{\"text\":\"$msg\"}"
logger -t disk-alert -p user.crit "$msg" # Also sends to journald via syslog
fi
done
chmod +x /usr/local/bin/disk-alert.sh
# Wire it into cron AND a systemd timer for belt-and-suspenders coverage
# Cron approach (runs every 15 minutes):
echo "*/15 * * * * root /usr/local/bin/disk-alert.sh" \
> /etc/cron.d/disk-alert
# Verify cron picked it up (look for syntax errors):
crontab -l -u root 2>/dev/null
run-parts --test /etc/cron.d/
# Test manually right now:
/usr/local/bin/disk-alert.sh && echo "Script executed cleanly"
Phase 6: Log Management and Audit Trail
Configure journald for persistent storage and sensible retention, then wire up logrotate for any plain-text logs your scripts write:
# /etc/systemd/journald.conf.d/99-production.conf
[Journal]
Storage=persistent # Survive reboots (default is "auto", which is volatile if /run only)
Compress=yes
SystemMaxUse=2G # Journals never consume more than 2 GB on disk
SystemKeepFree=500M # Always leave 500 MB free
MaxRetentionSec=90day # Auto-prune entries older than 90 days
MaxFileSec=1week # Rotate individual journal files weekly
ForwardToSyslog=no # Do not double-write to rsyslog (saves I/O)
# Restart journald to apply
systemctl restart systemd-journald
# Inspect current journal disk usage
journalctl --disk-usage
# Manually vacuum to target size (useful after changing limits):
journalctl --vacuum-size=1G
journalctl --vacuum-time=90d
# logrotate config for our custom alert log
cat > /etc/logrotate.d/disk-alert <<'EOF'
/var/log/disk-alert.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 640 root adm
}
EOF
# Test logrotate config (dry-run)
logrotate -d /etc/logrotate.d/disk-alert
Phase 7: Scheduled Maintenance Tasks
Every production server needs a set of recurring housekeeping jobs. We use both cron (for simple commands) and systemd timers (for jobs that need dependency tracking or resource limits):
# /etc/cron.d/server-maintenance
# All times are UTC (servers must always run in UTC — local time zones are a source of bugs)
# Security patches: unattended daily (only security updates, not all updates)
0 3 * * * root unattended-upgrades -d >> /var/log/unattended-upgrades/unattended-upgrades.log 2>&1
# Weekly full package update (review before applying on critical systems)
0 4 * * 0 root apt-get update -qq && apt-get upgrade -y -q >> /var/log/apt-weekly.log 2>&1
# Daily disk check
*/15 * * * * root /usr/local/bin/disk-alert.sh
# Remove old temp files older than 7 days (prevent /tmp from filling up)
0 2 * * * root find /tmp -type f -mtime +7 -delete 2>/dev/null
# Verify the healthapi service is responding (lightweight uptime probe)
*/5 * * * * root curl -sf http://127.0.0.1:8080/health >/dev/null || systemctl restart healthapi
Phase 8: Final Verification Checklist
Before declaring a server production-ready, run through this verification sequence. This is the equivalent of a pre-flight checklist — every item must pass:
The seven control layers of a hardened production server, with their corresponding verification commands.
# Run all verification checks in sequence — each should return the expected value
echo "=== SSH config ==="
sshd -T | grep -E "permitrootlogin|passwordauthentication|port|allowusers"
echo "=== Listening ports (external) ==="
ss -tulnp | grep -v '127.0.0.1\|::1'
echo "=== Kernel sysctl (key values) ==="
sysctl net.ipv4.tcp_syncookies net.ipv4.conf.all.rp_filter kernel.randomize_va_space
echo "=== healthapi service ==="
systemctl is-active healthapi && curl -sf http://127.0.0.1:8080/health
echo "=== Journal disk usage ==="
journalctl --disk-usage
echo "=== Cron jobs ==="
crontab -l -u root
echo "=== Firewall ==="
ufw status numbered
echo "=== All services (running) ==="
systemctl list-units --type=service --state=running --no-pager
Pro tip — infrastructure as code: Once you have built and verified this server manually, capture every configuration decision in Ansible playbooks or a Terraform + cloud-init template. The next server should take under ten minutes to reproduce identically. At Google and Meta, no production server is configured by hand — everything is version-controlled, code-reviewed, and applied by automation. Manual configuration is technical debt that silently accumulates until someone is paged at 2 AM by a snowflake server that behaves differently from every other host in the fleet.
What You Have Built
At this point you have a server that is aligned with production practice at any top-tier company:
Network surface: Firewall default-deny; only port 2222 open; kernel hardened against common network attacks.
Service hygiene: Application runs as a least-privilege service account under systemd, with memory and CPU caps, automatic restart, and full journald logging.
Observability: Persistent journal with retention limits; disk alerts wired to cron every 15 minutes.
Self-maintenance: Unattended security updates, weekly package upgrades, temp-file cleanup, and automated service recovery all run on schedule — the server heals itself for common failure modes without on-call intervention.
Next steps in a real organisation: This project covers the OS-level baseline. In practice you would layer on: centralised log shipping (Loki, Splunk, or CloudWatch Logs), host-based intrusion detection (AIDE, Falco), external uptime monitoring (Prometheus Blackbox Exporter or a SaaS probe), secrets management (Vault), and configuration management automation (Ansible, Chef, or Puppet) so the entire setup is reproducible in code. Those topics are covered in later tutorials in this DevOps path.