Copyright © 2018 DataScience.US All Rights Reserved.
The Lord of the Things: Spark or Hadoop?
Are people in your data analytics organization contemplating the impending data avalanche from the internet of things and thus asking this question: “Spark or Hadoop?” That’s the wrong question!
The internet of things (IOT) will generate massive quantities of data. In most cases, these will be streaming data from ubiquitous sensors and devices. Often, we will need to make real-time (or near real-time) decisions based off of this tsunami of data inputs. How will we efficiently manage all of this, make effective use of it, and become lord over it before it becomes lord over us?
The answer is already rising in our midst.
1. The Fellowship of Things
Organizations have already been collecting large quantities of heterogeneous data from numerous sources (web, social, mobile, call centers, customer contacts, databases, news sources, networks, etc.). This “fellowship” of data collections is already being tapped (or it should be) to derive competitive intelligence and business insights. Our data infrastructures and analytics ecosystems are evolving through the acquisition of data scientists and big data science capabilities, to allow us to explore and exploit this rich fellowship of data sources. We can start now to use our existing data analytic assets (or to build them up) in order to become lord of the things before the IOT overruns Middle Earth (I mean… our middleware environments).
2. The Two Powers
Hadoop and Spark are not opposed to one another. In fact, they are complementary in ways that are essential for dealing with IOT’s big data and fast analytics requirements. Specifically,
Hadoop is a distributed data infrastructure (for clustering the data), while Spark is a data processing package (for cluster computing).
Clustering the data – Apache Hadoop distributes massive data collections across many nodes within a cluster of commodity servers, which is absolutely critical for today’s huge datasets since otherwise we would need to buy and maintain hugely expensive custom hardware. Hadoop indexes and keeps track of where every chunk of data resides, thus enabling big data operations (processing and analytics) far more effectively than any prior data management infrastructure. The Hadoop cluster is easily extensible by adding more commodity servers to the cluster.
Cluster computing on the data – Apache Spark is a fast data processing package that operates on the distributed data collections that reside on the Hadoop cluster. Spark can hold the data in memory and carry out analytics much more quickly than MapReduce, which is the processing tool that traditionally came with Hadoop. The big difference is this: MapReduce operates in atomic processing steps, while Spark operates on a dataset en masse. The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results (updated data) to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc. This is fine if your data operations and reporting requirements are mostly static, and you can wait for data processing to execute in batch mode (which is the antithesis of IOT workflows). Conversely, Spark completes the full data analytics operations in-memory and in near real-time (interactively, with an arbitrary number of additional analyst-generated queries and operations). The Spark workflow looks like this: read data from the cluster, perform all of the requisite analytic operations, write results to the cluster, done!
The Spark workflow is excellent for streaming data analytics and for applications that require multiple operations. This is very important for data science applications since most machine learning algorithms do require multiple operations: train the model, test and validate the model, refine and update the model, test and validate, refine and update, etc. Similarly, Spark is very adaptable in allowing you to do repeated ad hoc “what if” queries of the same data in memory, or to perform the same analytic operations on streaming data as they flow through memory. All of those operations require fast full access to the data. Consequently, if the data have to be re-read from the distributed data cluster at every step, this would be very time-consuming (and a complete non-starter for IOT applications). Spark skips all of those intermediate time-consuming read-process-write-index operations — Spark performs all of the analytic operations at the minimal cost of one read operation.
Applications of Spark include real-time marketing campaigns, real-time ad selection and placement, online customer product recommendations, cybersecurity analytics, video surveillance analytics, machine log monitoring, and more (including IOT applications that we are only beginning to anticipate and envision). Some experiments have shown that Spark can be 10 times faster for batch processing and up to 100 times faster for in-memory analytics operations, compared to traditional MapReduce operations.