Logging at Scale: ELK & Loki

Querying & Investigating with Logs

18 min Lesson 8 of 28

Querying & Investigating with Logs

Knowing how to ingest and store logs is the foundation. Knowing how to query them under pressure — during an active incident at 3 AM, when the CEO is asking what broke — is the operational skill that separates senior engineers from everyone else. This lesson is a deep dive into effective log search: the query languages, the debugging workflows, and the habits that let you move from "something is wrong" to "root cause identified" in minutes rather than hours.

Two Query Languages: LogQL and KQL

Your choice of storage backend determines your query language. Loki uses LogQL; Elasticsearch uses KQL (Kibana Query Language) or the full Lucene syntax. Both follow the same conceptual pattern: first restrict the search space with label/index filters, then apply content filters over the narrowed result set. Doing it in the wrong order — scanning all content before filtering by label — is the single most common cause of slow queries at scale.

LogQL (Grafana Loki)

A LogQL query has two mandatory parts: a stream selector (curly-brace label matchers) and a log pipeline (pipe-delimited stages). The stream selector is evaluated against the index — it is cheap. The pipeline stages run over the raw chunk data — they are expensive and should be as narrow as possible.

# ── LogQL: progressive narrowing ─────────────────────────────────────────────

# 1. Stream selector only — returns all lines from payment-api in production
{app="payment-api", env="production"}

# 2. Add a content filter — scans only matching chunks for the keyword
{app="payment-api", env="production"} |= "timeout"

# 3. Parse JSON then filter on a structured field — far cheaper than regex
{app="payment-api", env="production"}
  | json
  | level="error"
  | duration_ms > 2000

# 4. Aggregate: requests per minute broken down by HTTP status (metric query)
sum by (status_code) (
  rate(
    {app="payment-api", env="production"} | json | __error__="" [1m]
  )
)

# 5. Pattern extraction for unstructured NGINX logs
{app="nginx", env="production"}
  | pattern `<client> - - [<_>] "<method> <path> <_>" <status> <bytes>`
  | status >= 500

Pro practice: In Loki, always put the most selective label first in your stream selector. If only one namespace is relevant, set namespace="checkout" before env="production". Loki evaluates label matchers left-to-right; the first match prunes the most chunks and makes every subsequent stage faster. Google SRE teams call this "filter pushdown" and it applies to every log query engine.

KQL (Kibana / Elasticsearch)

KQL is a simplified query language layered over Lucene. It maps intuitively to field-level searches, ranges, and boolean logic. For power users, the full Lucene syntax (enabled via the toggle in Kibana Discover) adds fuzzy matching, proximity searches, and wildcard patterns. Both are translated into Elasticsearch DSL queries internally.

# ── KQL / Lucene examples in Kibana Discover ──────────────────────────────────

# Simple keyword match in any field
timeout

# Field-specific match (KQL)
level: error AND service.name: payment-api

# Range query (KQL) — errors from the last 15 minutes that were slow
level: error AND http.response.status_code >= 500
  AND http.response.time_ms > 2000

# Wildcard — find any DB-related error message
message: *database* AND level: error

# Phrase match — the words must appear in this exact order
message: "connection refused"

# Exists check — only logs that have a trace ID (correlated requests)
trace.id: *

# ── Elasticsearch Query DSL (used in alerts, API calls, Kibana Lens) ──────────
# Equivalent to: level=error AND response_time > 2s in the last 5 minutes
POST /logs-production-*/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term":  { "level": "error" } },
        { "range": { "http.response.time_ms": { "gte": 2000 } } },
        { "range": { "@timestamp": { "gte": "now-5m" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "size": 100
}

Key idea: KQL filter clauses (as opposed to must in the DSL) do not affect relevance scoring and are cached by Elasticsearch — they are dramatically faster. Prefer filter for all structured-field queries. Only use must (which scores) when you need full-text relevance ranking, which is rare in operational log search.

The Log-Driven Debugging Workflow

A structured debugging workflow is not optional at big-tech scale — it is what stops you from thrashing randomly through logs while MTTR climbs. Top-tier SRE teams follow a consistent pattern regardless of the logging backend:

Five-step log-driven debugging workflow used by SRE teams at production scale.

Scope: Set the time window to just before the alert fired. Lock in the relevant service labels. Broad time windows exponentially increase query cost — start narrow and expand only if needed.
Volume: Run an error-rate query aggregated over time (e.g., rate({app="checkout"} | json | level="error" [1m])). Identify the exact minute the error rate spiked. This prevents you from wasting time on unrelated noise in the same window.
Filter: Drill into the spike minute. Apply keyword and field filters to isolate the error class. Look at a handful of raw log lines — the actual error message, stack trace, or upstream host is almost always visible here.
Correlate: Take a trace_id, request_id, or user_id from one of the failing log lines. Query across all services for that identifier. This reconstructs the full request path and reveals which service actually injected the fault, versus which services are downstream victims.
RCA: Establish the timeline: when did the first anomalous log appear? What changed (deploy, config push, traffic spike, certificate expiry)? What is the blast radius (how many users/requests were affected)? Document this for the postmortem.

Cross-Service Correlation with Trace IDs

The most powerful capability in a modern observability stack is jumping from a log line to the full distributed trace and back. This works only if every log line emitted by every service carries a trace_id that matches the trace recorded by Tempo or Jaeger. In practice this means your application framework (OpenTelemetry SDK, Spring Sleuth, etc.) must inject the active trace context into the log MDC/context map, and your structured log format must emit it as a top-level JSON field.

In Grafana, the Loki data source can be configured with a derived field that turns every trace_id value in a log line into a clickable link that opens Grafana Tempo at that exact trace. This eliminates the manual copy-paste step and is how top-tier teams achieve sub-five-minute MTTR on complex distributed failures.

# ── Grafana Loki datasource derived field (grafana.ini / provisioning) ─────────
# This turns every trace_id in a log line into a clickable Tempo link

apiVersion: 1
datasources:
  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":"([a-f0-9]{32})"'
          url: '$${__value.raw}'
          datasourceUid: tempo   # Tempo datasource UID — enables jump-to-trace

# ── LogQL: find all logs tied to a specific trace across every service ────────
{env="production"} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736"

# ── KQL equivalent in Kibana ─────────────────────────────────────────────────
trace.id: "4bf92f3577b34da6a3ce929d0e0e4736"

Alerting from Log Queries

A log query that you run manually during an incident is only half the value. The same query, running on a schedule, becomes a proactive alert that pages you before a user files a ticket. Both Grafana and Kibana support log-based alerting rules. In Grafana, a LogQL metric query can back a standard alert rule; in Kibana, Alerting rules evaluate KQL/ES|QL queries on a configurable schedule.

Production pitfall: Log-based alert rules that are too broad (e.g., alerting on any single error log line) generate catastrophic alert fatigue. Always express the alert as a rate or count threshold over a time window — for example, "more than 50 errors per minute sustained for 3 minutes." Use for: 3m in Grafana alert rules to require the condition to be true continuously before firing. This single setting eliminates the vast majority of spurious pages.

Common Query Anti-Patterns

These are the mistakes that make incident investigations slow and logging systems expensive:

Querying without a time bound. "All time" queries in Loki scan every chunk ever written. Always set a time range in the Grafana time picker or pass start/end to the Loki HTTP API explicitly.
Using regex when a string match suffices. |= "error" is a simple byte-scan; |~ "err.*r" invokes the RE2 engine. On gigabytes of logs the difference is a 10–20× query time increase. Only use regex when the pattern genuinely requires it.
High-cardinality labels in Loki. Adding user_id or request_id as Loki stream labels creates millions of streams and collapses query performance. Put high-cardinality data in the log line body (parsed with | json), not the label set.
Forgetting to check for parser errors. When LogQL parses a JSON log line that is malformed, it sets the __error__ label. Including | __error__="" in rate queries ensures you are counting real events, not parse failures masquerading as gaps in your data.

Key idea: Every query you write during an incident should answer a specific question. Before hitting Enter, ask yourself: "What am I expecting to see, and what will I do if I see something different?" This discipline — forming a hypothesis before running a query — is what keeps investigation time bounded. Random exploration of logs without a hypothesis can consume hours without producing an answer.