Developed in response to perceived issues with the performance of Hadoop MapReduce clusters, Apache Spark is an open source cluster computing framework that is able to run big data analytics up to 10 times faster from disk or 100 times faster from main memory. As apache.org puts it, “a fast and general engine for large-scale data processing.”
Originally developed at the University of California Berkeley’s AMPLab in 2009, open sourced under a BSD license in 2010 and donated to the Apache Software Foundation in 2013, Apache Spark, now a top-level Apache project, has acquired a significant following both in the developer and user worlds, as it approaches the stable release of version 2.0, expected within the next few weeks.
The killer combo
After being in use for more than 10 years, Hadoop has become the best-recognized solution for processing of large data sets that define today’s big data environment. The MapReduce processing component is the most common way that the data stored in the Hadoop Distributed File System (HDFS) is manipulated. But MapReduce is based on sequential step-by-step processing, while Spark operates on the data set as a whole, which is why it can work through data so much faster, especially when it fits in memory.
Since Spark doesn’t include its own file management system it needs to be used with a good distributed file system, of which HDFS is an excellent example. This means that Spark can fit right in to an existing Hadoop environment and provide the performance benefit, if the data calls for it. In fact, a standalone deployment can be done with Spork across HDFS, or it can run on top of Yarn, the Hadoop v2 cluster management technology, with no pre-installation or administrative access necessary. If Yarn is not yet deployed, users can make use of SIMR - Spark in MapReduce.
Spark and Hadoop can both be run in Docker containers, running on clusters of virtual machines, or even clusters of containers running on clusters of virtual machines. Without the need to install directly to bare metal, very lightweight, on-demand options can be configured.
If the data doesn’t need to be analyzed in near real time, then moving to Spark from MapReduce is probably not necessary. So Spark and Hadoop can be complementary, but neither is dependent on the other. Spark can be run standalone, on Hadoop, Mesos or public cloud services like Amazon EC2 and Microsoft Azure. Data can be accesed from any Hadoop data source, Cassandra, HPFS, HBase, Hive, and Tachyon.
The Big Blue bet
A year ago, IBM made a major commitment to Spark, calling it “potentially the most significant open source project of the next decade.” This commitment involved a $300 million investment, contributions from more than 3000 researchers at IBM labs worldwide, availablility of Spark-as-a-Service on its cloud platform, and donation of the IBM SystemML machine learning technology to the open source ecosystem.
A year later IBM has doubled down on Spark, announcing what it hailed as the first cloud-based development environment for near real-time, high performance analytics, the Data Science Experience. Running on the IBM Bluemix cloud platform are open source tools and a collaborative workspace that will enable researchers to make use of 250 curated data sets, all of which are targeted at improving the exchanging of information between data scientists and software developers.
Bob Picciano, senior vice president of IBM Analytics, was quoted as saying: “IBM’s Digital Science Experience is the killer enterprise app for Apache Spark, and gives data scientists’ new opportunities to deliver insight-driven models to developers, and opens the door for unprecedented innovation from the open source community.”
Spark is a major part of IBM’s future, being built into the core of its platforms including Watson, Commerce, Analytics, Systems, Cloud, and more than 30 other offerings including IBM BigInsights for Apache Hadoop, Spark with Power Systems and IBM Stream Computing.
Here to stay
It’s not just well-known vendors like IBM that are betting on Spark. ClearStory Data discovered Spark at AMPLabs back in 2011 and quickly understood the value of Spark-driven analysis to business. Its latest announcement is about the new Spark-based technology called Infinite Data Overlap Detection (IDOD), which makes use of the company’s Intelligent Data Harmonization technology (which is a measurement of how well any two data sets can be combined) and data inference. In theory, this should allow non-technical users to blend any type of data from multiple sources and analyze that data.
IDOD makes good use of the underlying technology to automate the prep and blending of the desired data, providing results in minutes compared to the days or weeks that manual modeling normally takes. This enables business technology users to make use of analytics and see where disparate data sets overlap and blend, actions that would otherwise require specialized expertise that is both expensive and hard to come by.
With the forthcoming release of Spark 2.0 it looks like the framework will become even more popular. According to the Databricks blog, Spark 2.0 can be described as easier, faster, and smarter. Easier due to expanded SQL support and streamlined APIs, faster due to significant optimizations and the use of the latest generation Tungsten engine, and smarter with the introduction of the Structured Streaming APIs. Structured Streaming will allow real-time querying against live data, among other things, though this is one of its most high-profile features.
The added capabilities went a long way towards convincing Wikibon Research analyst George Gilbert that by 2020, close to 40 percent of big data analytics spending will be on Spark, meaning that it will remain a significant presence in the data center. Given the option to be deployed both to bare metal and virtualized environments, as well as cloud and hybrid cloud configurations, Spark and Spark-enabled technologies should be available in most service catalogs.