The system which has made feasible the processing of big data at scales beyond the largest storage volume, is MapReduce. It’s a method created by Google engineers, based on research dating back to the Middle ages (2004), for subdividing a massive database process into aggregate components, and then algorithmically solving all those components in parallel. Because this subdivision took into account several disparate locations and file systems in tandem, each one run by a device with its own operating system, Google created an entirely new file system. Then its partner in the project, Yahoo, built a new operating system for executing tasks on that file system: Hadoop.
The big data industry — which has matured tremendously fast — was born around this central theme of partnership: Since we all need to solve the same problem, why not solve it together and get it done? MapReduce is one of Hadoop’s fundamental tasks; nearly every big data analytics system developed for Hadoop up until just last year, has utilized MapReduce at its core.B
But “open source” does not necessarily imply open communication. So when, at its I/O developer’s conference last Wednesday, Google’s own developers proclaimed to attendees that they had already ceased to use MapReduce in-house and had long since replaced it with an entirely different system, attendees were genuinely shocked.
And members of the tech press, busy inventing new superlatives to write about wearable watches — as though watches had never been wearable before — were evidently confused as to what had shocked them. (This despite the fact that Google had trumpeted its impending replacement of MapReduce at the very same conference a year earlier.)
“Dataflow is a system for building big and fast data analysis pipelines,” explained Google software developer Reuven Lax in a session to attendees Thursday. “The way it works is, you write a sequence of logical data transformations that are easy to write and very intuitive, that specify the analysis you want to do. You submit this to the Dataflow service, and it runs for you. It’s fully managed, so it runs for you on as many machines as are necessary in order to run your pipeline. All the configuration, all the parallelism, all the tuning is taken care of for you.”
The pipeline to which Lax refers is the sequence of acquisitions, aggregations, calculations, and transformations that explain how the entirety of data in a store is changed into the results we’re actually looking for. With procedural database logic such as SQL, you typically use a SELECT instruction to have an interpreter extract the subset of records from a table that match a given set of criteria — for example, everyone from Cardiff who owes more than £1,000.
A pipeline in Hadoop doesn’t really have the same purpose; moreover, it details the logical steps that would be required to segregate a mass of unstructured data (not a regular table) in such a way that it may be possible to extract such a list of everyone from Cardiff. That mass could be a stored copy of Twitter or Facebook interactions, or the texts of conversations between customers and your company’s support agents.
While a SQL SELECT statement solves a problem, a Hadoop pipeline engineers a method with which a problem of that nature may be solved. What Lax demonstrated Thursday was a method he and his Google colleagues created to represent the logical steps of that method using a language-agnostic set of instructions, such as an API library. This way, one developer could implement a Dataflow pipeline in Ruby, and another in Node.js. Lax wrote his examples in Java, as explicit classes.
Dataflow pipelines, Lax demonstrated, enable the insertion of user-defined code — effectively, compiled forms of methods written in such a discrete and abstract fashion that multiplicities of those methods can be executed in parallel. The command for this insertion is ParDo. It’s similar in concept to the creation of a mapper function in MapReduce, though as Lax explained, it’s much more general in scope.
Though these functions are “sculpted,” if you will, locally, once they’ve been debugged and tweaked, they’re submitted to Google’s Cloud Dataflow service. Here is where Google appears to be adopting a strategy similar to its (successful, thus far) approach for winning over customers to its Android mobile platform: gathering everyone together in one big boat in international waters, and then suddenly steering that boat in a new direction into Google’s exclusive territory.
While multiple sources walked away from Google I/O’s Wednesday keynote with the impression that Cloud Dataflow would replace MapReduce in Hadoop (a rather tall order given that Hadoop is an Apache project), Reuven Lax clearly explained on Thursday that, for now, Cloud Dataflow is a service running on Google’s cloud, replacing MapReduce only in that context.
Big data experts in the field recognized Cloud Dataflow’s profile immediately as fitting that of a 2012 Google project called FlumeJava (PDF available here). The Google developers who launched that project described it as “a pure Java library that provides a few simple abstractions for programming data-parallel computations. These abstractions are higher-level than those provided by MapReduce and provide better support for pipelines.
“FlumeJava’s internal use of a form of deferred evaluation enables the pipeline to be optimized prior to execution,” the team continued, “achieving performance close to that of hand-optimized MapReduces. FlumeJava’s run-time executor can select among alternative implementation strategies, allowing the same program to execute completely locally when run on small test inputs and using many parallel machines when run on large inputs.”
Put another way: FlumeJava was created to enable a more abstract way of explaining the assembly of a big data pipeline — one whose user did not have to specifically tune that assembly for the platform upon which it was running. Certainly FlumeJava was designed to substitute for MapReduce. But in the context in which Google’s developers introduced it in 2012, it was certainly not intended to alter the course of Hadoop.
More to the point, Google appears to be engineering its big data operations in its cloud to compete against Hadoop, along the way, characterizing Hadoop as something Yahoo started and that others have improved after Yahoo.
As product manager Marwa Mabrouk explained to attendees Thursday, Cloud Dataflow aims to alleviate developers’ burden of considering the infrastructure upon which data pipeline jobs are supported, prior to designing those jobs. “You don’t have to go hunting through logs that are distributed all over the place, or understand what is happening in the system by purchasing very expensive logging systems,” said Mabrouk, in a passing reference to Apache’s Flume log management service. “You can simply rely on the monitoring that we provide... and simply just focus on the logic of your application.”
Some big data developers this week compared Cloud Dataflow to Spark, a data processing engine that Apache Hadoop is testing as a successor to its own MapReduce implementation. While the two components are not interchangeable, Cloud Dataflow clearly has some of the same goals as Spark: improving development time, abstracting pipeline creation, and expediting execution.
And as such, Cloud Dataflow could put Google in a better position to compete against Apache Hadoop implementations with Spark, on what appears to be Spark’s platform of choice for now: Amazon EC2.