Archived Content

The following content is from an older version of this website, and may not display correctly.

They sound like comic book characters: Flume, Oozie, Sqoop, Hbase, MapReduce, Pig, Hive...  In fact, they are the components of the software framework at the heart of the so-called "big data" revolution, Hadoop. And Hadoop itself is literally named after a toy elephant.

Hadoop-enabled databases can span multiple volumes, extending past previous size limitations, and distributing processing to clusters of commodity hardware. In the process, they remove one of the principal reasons data warehouses were designed the way they were.

When database experts were first asked to distinguish “big data” from relational databases, they began by saying big data was not only big but unstructured, raw, unrefined.  It took relational refinement, tabular reorganization, to make data answerable to queries, and to enable it to be explored.  That excuse has also ceased to exist with the introduction of a component with a surprisingly normal-sounding name: Drill, a SQL interpreter for Hadoop “unstructured” data.

Drilling down
In mid-September, MapR, one of the commercial distributors of Hadoop, began including Apache Drill with its latest distribution.  In so doing, as MapR’s chief marketing officer Jack Norris tells DatacenterDynamics, Hadoop’s open source engineers strike down one more reason why data warehouses must retain their current complexity.

“With these new data models and data types, the requirement to define the schema ahead of time is often onerous,” says Norris.  “Some of the machine-generated data sources are high-volume, and they tend to change quickly.  You can update a Web-based application, and that could change the contents of the file and the schema.  So what we’re looking at is a sort of self-service data exploration capability that makes it much faster for business analysts and developers to get access to the data, to see what’s there, and gain insight.”

Here’s the problem:  Analytics applications deal with huge volumes of data at one time.  The insights these applications provide are statistical observations and trends.  They’re not transactional, in that they don’t make additions or updates to individual records.

In the past, for any application whatsoever to gain access to data through the DBMS, it had to be processed and refined.  Data warehouse architects call this process “landing;” it’s the act of bringing data from the outside world into the hangar, if you will, and lining it all up.  Ironically, it’s in this lined-up, organized stream where analytical processing is most constrained.  As Hadoop’s architects discovered, it’s actually easier for statistical formulas to break down big quantities of data into snapshots and smaller chunks (thus the inspiration for “MapReduce,” for which MapR was named) as though the data populated a geographical map and could be snapshot from an overhead camera.  This way, analytics can generate estimates such as percentages and trends.

But leaving the data raw had its own drawbacks, most notably that it was next to impossible to address that data using a procedural language like SQL.  Google tackled that problem with an internal project called Dremel, named for the hand-held, high-speed drill.  It then offered a form of Dremel to outside developers, called BigQuery, as a potential substitute for the MapReduce component that utilizes SQL.

Relocating the landing zone
In a white paper introducing BigQuery to database developers (PDF), Google explained, “Dremel is designed to finish most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (at least minutes, and sometimes even hours or days) to finish processing a dataset query.”  The paper further explained that Dremel makes huge data sets programmable using a language that’s already most familiar to database developers.

Apache Drill is the open source implementation of Dremel, and MapR’s distribution of Drill uses ANSI standard SQL.  While an existing Hadoop tool called Hive uses a SQL-like language to observe subsets of data, Drill’s syntax is spot-on standardized.

As a result, says MapR’s Norris, “You can do a query directly on a CSV file, and directly on HBase through Apache Drill.”  (CSV or "comma-separated variables" is the most raw format for data processing there can possibly be: strings of characters set off by quotation marks, and separated from one another by commas
).

This opens up a new avenue for data architects: analyzing raw data to determine which files or segments may actually be worth “landing” in a formal DW.  Not only might this make databases faster to implement, it could render the processed data — the part that businesses actually need to use on a continuous basis — smaller.

“We’re seeing a couple of different use types emerge,” Norris continues.  “One is at the beginning to figure out what kinds of data sources I should leverage, what kinds of applications make sense.  Which makes a lot of sense, because before you invest IT time, you want to get a feel for what the information is.  That could precede a more complex operation or an application that’s being developed.  It can also happen at the back end; it could be the result, say, an advanced process that’s dividing customer prospects into 12 distinct clusters.  Apache Drill could be used to query and find out specific elements within those clusters, or to do reporting to represent the differences in their purchases, or constituents within those clusters.”

Drill is an aptly named component (perhaps for the first time in the Hadoop world) for software that may reduce the physical space requirements of data centers.  Data warehouses may no longer need to be warehouse-size.