Asynchronous Processing & Messaging

Stream Processing

18 min Lesson 8 of 10

Stream Processing

Every system you have designed so far processes data in one of two ways: you either store it first and query it later (batch processing), or you act on it the moment it arrives (stream processing). Batch processing is straightforward — run a SQL query at 2 a.m., generate a report, done. But when a fraud signal must be acted on within 200 milliseconds of a transaction, or when a dashboard must reflect the last 60 seconds of user activity, batching is simply too slow. Stream processing is the answer.

At its core, stream processing means treating data as an infinite, continuous flow of events and computing results over that flow in near real-time — without waiting for the stream to end (it never does). The computations happen on data in motion, not data at rest.

Batch vs Stream: The Key Trade-Off

Batch and stream processing are not competing philosophies — they are tools chosen by latency requirements. The dividing line is roughly:

  • Batch: acceptable latency measured in minutes to hours. High throughput, simple fault recovery, easy to re-run. Classic examples: nightly ETL pipelines, monthly billing runs, offline machine-learning training.
  • Stream: required latency measured in milliseconds to seconds. Continuous output, complex fault recovery, harder to debug. Classic examples: fraud detection, live leaderboards, real-time recommendation updates, alerting systems.

Many modern architectures use both: a stream layer feeds a real-time view, while a batch layer periodically recomputes the authoritative historical view. This is the core idea behind the Lambda architecture (covered in Lesson 7).

Batch processing vs stream processing data flow Batch Processing Event Source Storage (accumulate hours) Batch Job Result latency: hours Stream Processing Event Source Stream Processor (each event) Live Result Continuous Output latency: ms–sec
Batch processing accumulates data then computes; stream processing emits results continuously as each event arrives.

Core Concepts

Windows

You can rarely compute a meaningful result over a single event in isolation. Stream processors apply windows — bounded slices of the infinite stream — to aggregate groups of events:

  • Tumbling window: Fixed, non-overlapping intervals. "Count clicks every 60 seconds." Simple, but a spike at the boundary of two windows is split across both.
  • Sliding window: A window of fixed size that advances with every event. "Count clicks in the last 60 seconds at all times." More accurate, higher cost — many windows overlap.
  • Session window: Groups events by activity gap. "Aggregate all clicks from this user until they are idle for 30 minutes." Natural for user-session analytics.

State

Many stream operations are stateful: a count, a running average, or a join against another stream requires remembering previous events. Stream processors store this state locally (e.g., RocksDB inside each Flink task) and checkpoint it to durable storage periodically. If a node fails, state is restored from the last checkpoint and processing resumes from the last committed offset in Kafka — this is how exactly-once semantics are achieved end-to-end.

Event Time vs Processing Time

Events have two timestamps: event time (when it happened in the real world) and processing time (when the processor received it). A mobile app operating offline for 10 minutes will deliver a burst of late-arriving events when it reconnects. If you window by processing time, those events land in the wrong window. Windowing by event time is more accurate, but requires the processor to wait for watermarks — a heuristic that says "all events up to time T have now arrived." After the watermark, late events are either dropped or trigger a window recomputation.

Key idea: In real distributed systems, clocks drift and networks delay packets. Always design your stream pipeline around event time with a configurable allowed lateness. Processing-time windows are simpler but will silently mis-attribute late-arriving events.

Major Stream Processing Frameworks

Three frameworks dominate production stream processing today:

  • Apache Flink — the gold standard for stateful stream processing. True event-time semantics, exactly-once end-to-end (with Kafka), millisecond latency, and a rich windowing and join API. Used by Alibaba (10 billion events/day), Netflix, and Uber. The mental model is a directed acyclic graph (DAG) of operators; each operator manages its own state.
  • Apache Spark Structured Streaming — an extension of the batch Spark API into micro-batches of 100 ms–1 s. Easier to adopt if you already use Spark for batch. Slightly higher latency than Flink, but the unified batch/stream API reduces complexity in Lambda-architecture pipelines.
  • Kafka Streams / ksqlDB — a lightweight library (not a separate cluster) that runs inside your own application JVMs, reading from and writing to Kafka. No separate infrastructure to operate. Excellent for per-service stream transformations; less suitable for massive cross-service aggregations that Flink handles elegantly.
Apache Flink stream processing pipeline architecture Kafka Source Topics Apache Flink Job (DAG of Operators) Parse & Filter Tumbling Window (60s) Aggregate (stateful) Enrich (join) State Store (RocksDB) Kafka Sink Topic Dashboard (WebSocket) Alert Service (threshold)
A Flink job reads from Kafka, applies a DAG of stateful operators (parse, window, aggregate, enrich), and fans results to multiple sinks simultaneously.

Real-World Example: Fraud Detection at Scale

A payment processor like Stripe handles roughly 250 million transactions per day — about 2,900 per second at average load, with peaks far higher. Detecting fraud synchronously in the payment path is not feasible: the ML scoring model needs context from the last 30 transactions by the same card, the last 10 minutes of activity on the merchant, and historical velocity patterns. Fetching all of that synchronously on every transaction introduces unacceptable latency.

The stream processing architecture instead:

  1. Every transaction event is published to Kafka immediately after authorization.
  2. A Flink job maintains keyed state per card and per merchant — running counts, velocity windows, anomaly scores — updated with each incoming event.
  3. When a threshold is crossed (e.g., 5 transactions from the same card in 2 minutes to different merchants), Flink emits a fraud signal to a downstream Kafka topic in under 50 ms.
  4. A separate service consumes that topic and triggers a hold on the card.

The key insight: the expensive per-card state is maintained continuously by the stream processor, so scoring any individual transaction is cheap — a single state lookup, not a multi-table query.

Exactly-Once Semantics in Practice

Delivering exactly-once results in a distributed system is hard. Three weaker guarantees are simpler:

  • At-most-once: Drop events on failure. Simplest, but you lose data.
  • At-least-once: Replay on failure. You get duplicates, which downstream must handle (idempotency — covered in Lesson 6).
  • Exactly-once: No loss, no duplicates. Requires coordinated checkpointing between source, processor, and sink. Flink achieves this via Chandy-Lamport distributed snapshots combined with Kafka transactions. Cost: roughly 10–20% throughput overhead for snapshotting. Well worth it for financial or inventory systems.
Practical tip: For most analytics pipelines, at-least-once with idempotent sinks is the right default — simpler to operate and recover, with near-zero impact on dashboard accuracy. Reserve exactly-once for financial ledgers, inventory updates, and any pipeline where duplicate writes have real-world consequences.
Common pitfall: Confusing Kafka delivery guarantees with end-to-end exactly-once. Kafka itself can guarantee exactly-once delivery between producer and broker, but if your stream processor checkpoints less frequently than the Kafka offset commits, replays can still produce duplicates in your sink. Always validate the entire pipeline, not just the broker.

Choosing the Right Tool

The decision framework is simple. Ask three questions:

  1. What latency do you need? Sub-100 ms → Flink or Kafka Streams. Sub-second is fine → Spark Structured Streaming.
  2. How much state do you carry? Stateless or tiny state → Kafka Streams runs in your service. Heavy cross-stream joins and large keyed state → Flink on a dedicated cluster.
  3. Do you already use Spark for batch? If yes, Spark Structured Streaming reduces operational complexity by reusing the same cluster and the same engineers.

Stream processing is not magic — it trades the simplicity of batch for the power of low-latency continuous output. Master the windowing model, understand exactly-once semantics, and know your latency requirements, and you will be able to design pipelines that turn raw event streams into real-time business intelligence.