Monitoring vs Observability
Monitoring vs Observability
You have spent earlier tutorials building systems that run: containerized services on Kubernetes, infrastructure provisioned by Terraform, pipelines that promote code from commit to production. Now comes the question that every senior engineer eventually confronts: how do you actually know what those systems are doing — not just right now, but when something goes wrong at 2 AM in a way you have never seen before?
This distinction drives everything in this tutorial. Monitoring and observability are related but fundamentally different ideas, and conflating them is one of the most expensive mistakes you can make in a large-scale system.
Monitoring: Known-Unknowns
Monitoring is the practice of collecting and alerting on a predefined set of signals that you already believe to be important. You decide in advance: "I care about CPU utilization, HTTP 5xx rate, queue depth, and p99 latency." You set thresholds. When a threshold is crossed, you get paged.
Monitoring answers questions you already know to ask. The mental model is a checklist: is CPU okay? Is error rate okay? Is disk okay? If every item on the checklist passes, you declare the system healthy. If one item fails, you know which dashboard to open.
Monitoring is excellent at detecting known failure modes — the failures you have seen before, the ones you anticipated when you designed the system. A database connection pool exhausted, a memory leak that manifests as a steady RSS climb, a downstream service timing out and causing latency spikes. These are your known-unknowns: you do not know when they will happen, but you know that they can happen, so you instrumented for them.
The Observability Gap: Unknown-Unknowns
Here is the hard truth: in a sufficiently complex distributed system, the failures that really hurt you are the ones you did not anticipate. A rare combination of input parameters that exposes a code path never hit in load testing. A network partition between two specific availability zones that only occurs under a particular traffic pattern. A third-party API that begins returning stale data silently, causing your recommendation engine to degrade without any error rate spike. These are unknown-unknowns — you did not know the failure mode existed, so you did not instrument for it, and your monitoring tells you nothing useful.
Observability is the property of a system that lets you understand its internal state by examining its external outputs — without needing to know in advance what questions you will ask. The term comes from control theory: a system is "observable" if you can determine its internal state from its outputs alone. Applied to software engineering, it means your system emits enough data — metrics, logs, traces — that you can reason backward from an unexpected symptom to its root cause, even for failure modes you have never seen before.
The critical difference: with monitoring you get an alert and then look at predefined dashboards. With observability you get an alert and then ask new questions of your telemetry data to explore what is actually happening. You slice by user ID, by region, by version, by request path — whatever the data leads you to — until you narrow the causal chain.
A Concrete Production Example
Consider a checkout service at an e-commerce company. Your monitoring tells you: error rate is 0.05%, p99 latency is 340ms, CPU is at 42%. All green. But revenue is down 18% for the past six hours. This is a classic observability gap — no monitored metric crossed a threshold, yet the system is deeply broken.
With a fully observable system you can ask: which segment of users is failing? Break it down by country — Brazil shows checkout completion rate of 11% versus the usual 89%. Drill into traces for Brazilian requests — the payment gateway call has a 28-second timeout being silently swallowed by a try/catch that returns a fake success. The exception is logged, but nobody built a monitor on the error count for that specific payment provider.
You found this in minutes because the data was there and you could query it freely. Without that data, you would have spent hours looking at the green dashboards wondering why revenue was down.
Why the Distinction Matters Now
In a monolith with ten servers, monitoring was sufficient. You had a small number of components, well-understood failure modes, and engineers who knew every code path. In a microservices system on Kubernetes with 200 services, thousands of pods, and hundreds of deploys per day, the cardinality of possible failure states is astronomically higher. Monitoring alone cannot keep up.
Two trends have made this non-optional at production scale:
- Cardinality explosion: Modern systems emit high-cardinality data — request IDs, user IDs, trace IDs, feature flags, experiment variants. Traditional time-series monitoring tools (Nagios, early Prometheus) were not designed to query across these dimensions freely. Purpose-built observability backends are.
- The shift-left of production ownership: As "you build it, you run it" becomes the norm, product engineers own their own on-call. They are not domain experts in every failure mode — they need tools that let them investigate freely, not just check dashboards built by someone else months ago.
The Three Pillars (Preview)
Observability is typically implemented through three complementary signal types, which the next lesson covers in depth:
- Metrics — aggregated numeric measurements over time (request rate, error count, latency percentiles). Efficient to store, great for alerting, limited cardinality.
- Logs — structured event records emitted at runtime. High detail, arbitrary fields, expensive at scale without careful management.
- Traces — records of a single request's journey across multiple services. Essential for latency attribution and dependency mapping in distributed systems.
None of these alone is sufficient. A mature observability practice uses all three and correlates them — you get an alert from a metric, jump to the related traces, and inspect the log lines from the spans that look anomalous. The tooling that makes this correlation seamless (Honeycomb, Grafana + Tempo + Loki, Datadog APM) is what separates a production observability stack from a collection of dashboards.
Shifting the Mental Model
The practical takeaway is a shift in how you think about your systems and your relationship to their failure modes. Monitoring asks: is this component within expected parameters? Observability asks: what is this system actually doing, and why? Monitoring is a set of assertions. Observability is a capability for inquiry.
Building toward the latter requires making deliberate decisions at every layer of your stack — how your applications emit data, how your infrastructure is instrumented, what tooling you use to store and query telemetry, and how your team develops the practice of exploratory debugging against live data. The remaining nine lessons in this tutorial build each of those layers systematically.