The OpenTelemetry Collector
The OpenTelemetry Collector
Every service you instrument eventually needs to get its telemetry somewhere useful — a tracing backend, a metrics store, a log aggregation system. You could wire each SDK directly to each backend, but that approach collapses under operational reality: credentials scattered across every pod, no way to enrich or filter data before it leaves the application, and a re-deploy every time you change backends. The OpenTelemetry Collector solves all three problems at once. It is a vendor-neutral, production-grade telemetry pipeline that receives signals from your applications, transforms them, and routes them to one or more backends — all without touching application code.
At Google-scale organisations, the Collector is not optional. It is the central nervous system of the observability stack: the single plane where data governance (sampling, PII scrubbing, cost control) is enforced before anything reaches a paid backend.
Architecture: Receivers, Processors, Exporters
The Collector is a composable pipeline. Every pipeline has three stages in order:
- Receivers — ingest telemetry from sources. The
otlpreceiver accepts OTLP over gRPC (port 4317) and HTTP (port 4318). Other receivers pull from Prometheus endpoints, Jaeger, Zipkin, Kafka, Fluent Bit, and more. A Collector instance can run many receivers simultaneously. - Processors — transform, filter, batch, and enrich data in flight. Processors are the operational muscle of the pipeline: they drop spans you do not need, add Kubernetes metadata, cap attribute counts, and batch exports for throughput efficiency.
- Exporters — push transformed data to backends. The OTLP exporter speaks to Grafana Tempo, Honeycomb, and any OTel-native backend. The Prometheus exporter exposes a scrape endpoint. The
debugexporter prints to stdout — invaluable during development.
Pipelines are declared per signal type (traces, metrics, logs) and can fan-out to multiple exporters simultaneously. Connecting the same processor to multiple pipelines lets you enforce a single normalisation rule across all signal types.
health_check), a pprof profiler endpoint (pprof), and a zPages debug UI (zpages). Extensions run alongside pipelines but are not part of the data flow.
A Production-Grade Collector Configuration
The following is a realistic otelcol-config.yaml that you would deploy as a DaemonSet or sidecar in Kubernetes. It covers the most important processors and a multi-backend export setup.
memory_limiter first in every pipeline. If a traffic spike overwhelms the Collector's internal queue, the exporter will back-pressure and eventually drop data. Without memory_limiter, the process OOM-kills itself — and drops everything in its queue. With it, the Collector starts refusing new data gracefully (returning a retryable error to the SDK) before it runs out of memory. Omitting this processor is the single most common production Collector misconfiguration.
Deployment Patterns
How you deploy the Collector determines its operational characteristics. Three patterns dominate production environments:
- DaemonSet (Agent mode) — one Collector pod per node. Each pod receives telemetry from all applications on that node. Low network hops, can enrich spans with node-level metadata, tolerates Collector restarts with minimal blast radius. The recommended default in Kubernetes. Managed by the OpenTelemetry Operator via the
OpenTelemetryCollectorCRD withmode: daemonset. - Sidecar mode — one Collector container per application pod. Maximum isolation; ideal for multi-tenant clusters where teams must not share a pipeline. Higher resource overhead. Use for security-sensitive workloads or when you need per-service sampling policies.
- Gateway (Deployment) mode — a central, horizontally-scaled Collector fleet. All node-level Collectors forward to it via OTLP. The gateway enforces cluster-wide sampling, PII scrubbing, and fan-out to multiple backends. Enables stateful processors like
tail_samplingthat need to see all spans of a trace before making a sampling decision. In large clusters (100+ nodes), this two-tier topology — agent + gateway — is standard.
Key Processors You Must Know
Beyond the basics, three processors define production-quality pipelines:
k8sattributes— auto-enriches every span and log withk8s.pod.name,k8s.namespace.name,k8s.deployment.name, and labels likeapp.version. Requires aClusterRolewithget/list/watchon pods. Without this, correlating a Tempo trace to the Kubernetes workload that produced it requires painful manual cross-referencing.tail_sampling— makes sampling decisions after seeing the complete trace (unlike head-based sampling which decides at the first span). Policy types includelatency(keep any trace over 200 ms),error(keep all traces with at least one error span),probabilistic(keep 1% of healthy fast traces). Must run in Gateway mode so the Collector can buffer all spans of a trace before deciding. This is the most operationally powerful processor — it lets you sample intelligently without losing the traces you actually need.spanmetrics(connector, not processor) — derives RED metrics (rate, error rate, duration histogram) directly from trace spans, without extra application instrumentation. Emitstraces_spanmetrics_calls_totalandtraces_spanmetrics_duration_milliseconds. This is how large teams get service-level metrics for free from the tracing pipeline.
otelcol validate --config otelcol-config.yaml locally or in CI. The Collector will exit with a clear error message for typos, unknown component names, or pipeline wiring mistakes. Adding this as a CI step prevents rolling out a broken pipeline to production — a misconfigured Collector silently drops all telemetry, which you may not discover until an incident when you go looking for traces.
The OpenTelemetry Collector is deceptively simple to start with — a single binary, a YAML file — and extremely powerful to operate at scale. Mastering its processor chain and deployment topology is a core DevOps skill: it is the difference between an observability stack that degrades under load and one that remains the last reliable source of truth exactly when you need it most.