Archived Content

The following content is from an older version of this website, and may not display correctly.

Cloudera’s purchase of Gazzang last week means the principal means for end-to-end Hadoop encryption comes under the stewardship of one company, and Microsoft said it believes this could lead to an advantage for its own services.

Encryption in a big data process such as Hadoop is a delicate and complicated set of maneuvers.

Encrypting the process which sent you this Web page would be a simple affair, since it involves only one session between your Web browser and Datacenterdynammics’ Web server.

But data in a Hadoop cluster it may be scattered across the planet, so numerous sessions can be involved in what appears on paper to be one single operation.

This is why the business of encrypting big data is such a big deal — arguably more so than the segment of the security software industry that encrypts individual hard drives or storage networks.

Microsoft said this could mean that more eyes will be drawn to its own offerings, as the company once known as the standard bearer for everything not open source, steers its Azure cloud service on a new course away from just Windows.

One of the services Azure provides is called HDInsight, essentially Microsoft’s distribution of Hadoop by way of Windows Server.

Azure enables customers (now not necessarily exclusive to Windows) to deploy Apache Hadoop clusters.

Last week at the Hadoop Summit in San Jose, Microsoft announced it had upgraded HDInsight to support Apache’s version 1.4 release of Hadoop, made official only weeks earlier.

In a company blog post, Microsoft product marketing official for Hadoop Oliver Chiu promised HDInsight customers would enjoy the benefits of performance improvements as high as two orders of magnitude, over previous versions, for customers who store their Hadoop blobs in Azure.

Yet one of Hadoop 1.4’s other key enhancements is a feature called encrypted shuffle.

It’s not the best explained feature in the system (new open source features often take years before someone gets around to explaining them well enough) but basically, it’s the use of bi-directional SSL to encrypt all the sessions that take place during the so-called “shuffle phase” of MapReduce, the task in Hadoop that maps identified or queried data to their present locations.

Like picking up a dropped deck of cards, shuffling gives this data some modicum of organization, even though it doesn’t really put it all in one place.

Not all Hadoop communications are encrypted end-to-end, though there is pertinent arguments as to why they should be.

DatacenterDynamics asked a Microsoft spokesperson whether Cloudera bringing Gazzang under its wing would impact how Azure utilizes encryption to secure sessions.

It’s an important question, especially since Microsoft’s key Hadoop partner is Hortonworks, Cloudera’s biggest competitor in this space.

“Microsoft Azure does not directly use third-party security technologies,” the spokesperson responded.

“But implements widely accepted and industry-standard mechanisms for encryption.

“Various third-party security technology solutions work on Microsoft Azure, including solutions from Trend Microsystems and Barracuda Networks.”

The spokesperson went on to say that the encryption algorithms Azure uses are FIPS certified by the US National Institute of Standards and Technology. That’s necessary for database systems to meet HIPAA regulatory standards.

But the concepts of database encryption as regulatory bodies came to understand them are worlds different from the problems of encryption in Hadoop.

Without end-to-end encryption, ‘data at rest’ in Hadoop (such as it is) is essentially in the clear.

Intel sought to resolve this problem last year through its open source Project Rhino, but that project was suspended last month after Intel sought to fund Cloudera instead. Just days later, Cloudera purchased Gazzang.

One other company that produces an encrypted database that runs on Hadoop is RainStor.

In a July 2013 company blog post, RainStor architect Mark Cusack wrote: “Because of the lack of encryption of data at rest in core Hadoop, HDFS should be considered completely untrusted as a storage platform for sensitive information. I’d liken it to storing sensitive data in a public cloud. Such data should be encrypted and the encryption keys must be kept out of HDFS.”

So while Microsoft reminds us that it can accept new data blobs into encrypted HDInsight databases by literally mailing or shipping Microsoft the hard disks containing those blobs (BitLocker being the preferred method of encrypting those disks in transit), the broader problem of how an Azure HDInsight customer implements secure key management may, for now, be unresolved.