Traditionally enterprises processed large amounts of data using batch processing. This was and in many cases even now achieved by having ETL jobs that move all the data into operational databases or data warehouses and then running analytical (including machine learning) tools over those.
Batch processing in such a traditional sense has some disadvantages – high latency (processing cannot start until all the data has been collected), less flexible (difficult to work with single global schema for all the use-cases), high operational cost (resource intensive) – highly successful batch processing systems such as Hadoop also suffer from these limitations.
More and more enterprises, decision makers and end-users are seeking to gain actionable insights in real/near-real time (think sub-millisecond latency between when something happened vs. when we get to act on it). Given the high volume, variety and velocity of modern data, batch processing systems simply cannot provide such real-time guarantees. This is where streaming architectural patterns come in – these are systems that are designed ground-up to provide the real/near-real time guarantees that modern applications demand.
In this post we will look into the distinction between batch and steam processing systems from the perspective of how the data flows through the system – ingress (how datasets get into the system), processing (what we do with datasets), egress (how datasets get out of the system).
Terminology
For the purposes of this post, we adopt terminology as outlined in Tyler Akidau’s the world beyond batch streaming 101 post. In particular, the terms “batch processing” and “stream processing” refer to the use of a particular type of “execution engine” for data processing, regardless of whether the dataset is bounded or unbounded.
Ingress
Batch Processing
- typically works with bounded data
- traditionally run less time critical workloads so ingestion latency is less critical (generally good enough if all data is collected by the time the batch job starts)
Stream Processing
- typically works with unbounded data
- designed for low-latency applications so ingestion latency is more critical as processing gets delayed if ingestion is slow
It is common to use platforms such as Kafka and AWS Kinesis that are designed ground-up to provide high-throughput, low-latency ingestion capabilities and offer horizontal scalability (e.g. AWS Kinesis supports adding shards to scale horizontally, with each shard supporting ingestion of up to 1000 data records per second, or 1MB/sec. With Kafka, it is possible to achieve millions of writes per second). Furthermore such platforms automatically optimize for high throughput ingest via techniques such as using compression, buffering, and load balancing.
Processing
Batch Processing
- batch jobs consume datasets from large operational databases or data warehouses
- processing waits until all data has been collected
- processing tends to be slow, for instance Hadoop reads and writes datasets into HDFS in every map/reduce cycle
- less flexibility in terms of running iterative or interactive algorithms
- less flexible data model due to global schema
- limited programming model
Stream Processing
- stream execution engine processes datasets as they arrive
- processing starts as soon as data enters the streaming execution engine
- processing tends to be fast (e.g. Apache Spark keeps as much data in memory as possible, and reduces disk I/O)
- more flexibility in terms of running iterative and interactive algorithms
- more flexible data model
- richer programming model. E.g. Apache Spark provides unified, concise, high-level APIs for batch analytics (RDD API), SQL queries (Dataset API), real-time analysis (DStream API), machine learning (ML Pipeline API) and graph processing (Graph API).
Egress
Batch Processing
- results not available until batch job finishes execution
- downstream reporting, analytics and other machine learning tasks are carried over after batch processing is complete
- results could be saved in data warehouses or other long-term storage
Stream Processing
- results available on an on-going basis
- downstream reporting, analytics and other machine learning tasks could be carried over in near-real time
- results could be saved in data warehouses or other long-term storage
Stream processing platforms make it possible to do more real-time analysis, allowing more possibilities for egress. For instance, one could have a dashboard that directly consumes datasets from the stream processing platform and shows results of analysis in near-real time.
Conclusion
- we compared and contrasted batch and stream processing platforms from the perspective of how the data flows through the system
- as an example of streaming with Apache Spark, check this repository which provides a fully functional small and simple application that computes aggregates over a sliding window of datasets as they arrive over a socket
- if you are new to streaming, you might find these useful:
- how to beat the cap theorem by by Nathan Marz, creator of Apache Storm
- questioning the lambda architecture by Jey Kreps, founder of Confluent
- the world beyond batch streaming 101 by Tyler Akidau, Google
- the world beyond batch streaming 102 by Tyler Akidau, Google
- the dataflow model (Google)– inspired many modern streaming platforms, including some of the commercial ones such as hazelcast
- ETL Is Dead, Long Live Streams: real-time streams w/ Apache Kafka