Cloud Fundamentals: AWS Core Services

CloudWatch Essentials

18 min Lesson 8 of 30

CloudWatch Essentials

Amazon CloudWatch is the native observability plane for every AWS service. It is not simply a place to store logs — it is the control loop that keeps production systems healthy: metrics flow in from every EC2 instance, RDS cluster, and Lambda function; alarms react when numbers cross thresholds; Logs Insights queries drill into raw log streams in seconds; and dashboards give your team a shared situational awareness board. At big-tech scale, misconfigured CloudWatch is one of the most common root-causes of incident blind-spots. This lesson makes you dangerous with all four pillars: metrics, alarms, logs, and dashboards.

Metrics: The Foundation

Every AWS resource publishes metrics to CloudWatch automatically. A metric is identified by its namespace (e.g., AWS/EC2), one or more dimensions (key-value pairs that scope the metric, such as InstanceId=i-0abc1234), and a metric name (e.g., CPUUtilization). Datapoints arrive at a resolution of either 60 seconds (standard) or 1 second (high-resolution, available for custom metrics).

The default EC2 metrics — CPU, network bytes, disk read/write ops — are free and collected by the hypervisor. They do not include memory utilization or disk fill percentage, because the hypervisor cannot see inside the guest OS. To get those, you must install the CloudWatch Agent and push custom metrics.

At every company that runs EC2 seriously, the CloudWatch Agent is baked into the base AMI via a UserData script or an SSM State Manager association. Treat memory and disk metrics as non-negotiable — you will page-fault into a full disk on a weekend if you skip them.

# Install and configure the CloudWatch Agent on Amazon Linux 2 / AL2023
sudo yum install -y amazon-cloudwatch-agent

# Write a minimal agent config (memory + disk metrics every 60 s)
sudo tee /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json <<'AGENTEOF'
{
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/"],
        "metrics_collection_interval": 60
      }
    }
  }
}
AGENTEOF

# Start the agent (picks up the config above)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
  -s

# Verify the agent is running
sudo systemctl status amazon-cloudwatch-agent

You can also publish arbitrary application metrics — request latency, queue depth, business KPIs — using the AWS CLI or SDK. The put-metric-data call is the primitive behind every custom metric dashboard at scale.

# Push a custom metric: order queue depth = 42
aws cloudwatch put-metric-data \
  --namespace "MyApp/Orders" \
  --metric-name QueueDepth \
  --value 42 \
  --unit Count \
  --dimensions Environment=production

# Pull the last 5 minutes of that metric to verify ingestion
aws cloudwatch get-metric-statistics \
  --namespace "MyApp/Orders" \
  --metric-name QueueDepth \
  --dimensions Name=Environment,Value=production \
  --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time   $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Average Maximum

Alarms: Automated Reaction

An alarm watches a single metric (or a math expression over multiple metrics) and transitions between three states: OK, ALARM, and INSUFFICIENT_DATA. When an alarm enters ALARM state it can trigger an SNS notification, an Auto Scaling policy, an EC2 action (reboot, stop, terminate), or a Systems Manager OpsItem.

The alarm threshold is evaluated over a period (the aggregation window, minimum 10 s for high-resolution metrics, typically 60 s) and a datapoints-to-alarm / evaluation-periods pair. Setting datapoints-to-alarm=3 out of evaluation-periods=3 means the metric must exceed the threshold for three consecutive periods before the alarm fires — this eliminates single-spike false positives, a critical production pattern.

CloudWatch alarm transitions between OK, ALARM, and INSUFFICIENT_DATA states, triggering actions (SNS, Auto Scaling, EC2) on entry to ALARM.

# Alarm: CPU above 80% for 3 consecutive 60-second periods -> SNS alert
aws cloudwatch put-metric-alarm \
  --alarm-name "web-cpu-high" \
  --alarm-description "EC2 CPU above 80% for 3 minutes" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=AutoScalingGroupName,Value=web-asg \
  --statistic Average \
  --period 60 \
  --evaluation-periods 3 \
  --datapoints-to-alarm 3 \
  --threshold 80 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --ok-actions     arn:aws:sns:us-east-1:123456789012:ops-alerts

# Check alarm state
aws cloudwatch describe-alarms \
  --alarm-names "web-cpu-high" \
  --query "MetricAlarms[*].{Name:AlarmName,State:StateValue,Reason:StateReason}"

treat-missing-data=breaching is the safe production default for most alarms. If your metric source goes away (agent crash, instance terminated), you want the alarm to fire, not silently sit in INSUFFICIENT_DATA while your on-call misses the outage. Use notBreaching only for intentionally sparse metrics like scheduled jobs.

Logs: CloudWatch Logs and Logs Insights

Application logs are shipped to CloudWatch Logs via the CloudWatch Agent, the AWS SDK, or native service integrations (Lambda, ECS, EKS via Fluent Bit). The hierarchy is: Log Group (one per application or service) → Log Stream (one per instance or container) → individual log events.

Set a retention policy on every log group. The default is indefinite retention, which is expensive at scale. 30 days covers most compliance requirements and incident lookbacks; archive to S3 for longer-term audit needs.

# Create a log group with a 30-day retention policy
aws logs create-log-group \
  --log-group-name /app/web/access

aws logs put-retention-policy \
  --log-group-name /app/web/access \
  --retention-in-days 30

# Stream the last 5 minutes of a log stream (useful during incidents)
aws logs tail /app/web/access --since 5m --follow

# Logs Insights query: top 10 error messages in the last hour
aws logs start-query \
  --log-group-name /app/web/access \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time   $(date +%s) \
  --query-string \
    'fields @timestamp, @message
     | filter @message like /ERROR/
     | stats count(*) as errorCount by @message
     | sort errorCount desc
     | limit 10'
# Then retrieve results by QueryId:
# aws logs get-query-results --query-id 

Logs Insights is the ad-hoc query engine over log groups. Its syntax resembles a simplified SQL pipeline: fields selects columns, filter is the WHERE clause, stats aggregates, sort orders, and limit caps results. Queries run in parallel across shards and return in seconds even over terabytes of logs — far faster than grep over SSH.

You can also create a Metric Filter to turn a pattern in a log stream into a metric, and then alarm on that metric. This is how you build error-rate alarms from application logs without any SDK changes.

Structured logging is a force-multiplier. If your application emits JSON log lines (e.g., {"level":"error","status":500,"path":"/api/orders","duration_ms":1234}), Logs Insights can parse every field automatically via @message like /.../ or using the built-in JSON field extraction. Structured logs make the difference between a 5-minute Logs Insights query and a 2-hour grep session during a P1 incident.

Dashboards: Shared Situational Awareness

A CloudWatch dashboard is a collection of widgets pinned to a shareable URL. Each widget displays a metric graph, an alarm status, a log query result, or a text note. Dashboards are defined as JSON — which means they can be version-controlled and deployed via Infrastructure as Code (CloudFormation or Terraform).

A production service should have at minimum a "golden signals" dashboard: latency, error rate, traffic (RPS), and saturation (CPU/memory). These four signals cover the vast majority of production failures and were formalized in Google's SRE book as the canonical starting point for service-level monitoring.

CloudWatch observability stack: sources push metrics and logs in; alarms, dashboards, and Logs Insights queries provide reactive and exploratory outputs.

# Create a dashboard from JSON (golden-signals template)
aws cloudwatch put-dashboard \
  --dashboard-name "WebService-GoldenSignals" \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0, "y": 0, "width": 12, "height": 6,
        "properties": {
          "title": "Request Latency (p99)",
          "metrics": [
            ["MyApp/API","Latency","Service","orders",{"stat":"p99","label":"p99"}]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 12, "y": 0, "width": 12, "height": 6,
        "properties": {
          "title": "5xx Error Rate",
          "metrics": [
            ["AWS/ApplicationELB","HTTPCode_Target_5XX_Count",
             "LoadBalancer","app/web-alb/abc123",{"stat":"Sum"}]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      }
    ]
  }'

Production Checklist

Install the CloudWatch Agent on all EC2 instances — memory and disk metrics are essential.
Set a retention policy on every log group. Log without retention = runaway cost.
Use datapoints-to-alarm equal to evaluation-periods for most alarms to avoid false positives from transient spikes.
Set treat-missing-data=breaching on latency, error-rate, and heartbeat alarms.
Emit structured JSON logs from your application so Logs Insights can parse fields natively.
Build a golden-signals dashboard (latency, errors, traffic, saturation) before your service goes to production, not after the first incident.
Version-control dashboard JSON in your IaC repository alongside the infrastructure it monitors.

CloudWatch metrics are retained at 1-second resolution for 3 hours, 1-minute for 15 days, 5-minute for 63 days, and 1-hour for 15 months. Design your alarm periods accordingly — a 1-minute period alarm will still have data for two weeks, giving you the incident replay window you need during post-mortems.