Intel – the new gorilla of Hadoop distributions

Chipmaker asks, if you're going to build a Hadoop cluster on our chips, why not buy our version of Hadoop?

23 May 2013 by Yevgeniy Sverdlik - DatacenterDynamics

Intel – the new gorilla of Hadoop distributions
Intel Xeon E5 die

Intel sees the future and in the future it sees lots of servers strung together into clusters, parallel-processing massive amounts of unstructured data, taking instructions from the open-source Apache Hadoop framework. It wants to see those clustered commodity machines of the future carry its processors.

 

Since Hadoop has become the de facto platform for the rapidly growing space of big data analytics, what better way to take the bull by the horns than to make your own distribution of the open-source framework? This is what the chipmaker did in February, announcing the release of the Intel Distribution for Apache Hadoop and integrating it ever so tightly with its Xeon processors.

 

The basic message is something like this: You can run it on any hardware, but it works best with Intel-based hardware. You can run any other Hadoop distribution, but if you run Intel's, you get the enterprise features that Intel knows so much about. Additionally, you get backing for your Hadoop system from a provider with Intel's reputation.

 

Vin Sharma, open-source software strategist at Intel, says the need for an enterprise-focused version of Hadoop is acute, since the framework evolved primarily as a tool for web-scale analytics by large service providers and not for traditional enteprise users.

 

“We've noticed that enterprises have a specific set of reqauirements for reliability, scalability, security and manageability,” Sharma says. Intel saw this as an opportunity and decided to invest into Hadoop.

 

Enterprises, by their nature, are also very risk averse and prefer an established vendor to startups. “Enterprises feel like they need a single trusted vendor to whom they can go to for support and services,” Sharma says. Intel is also able to provide those services on a global scale.

 

Enterprise features

Technology-wise, Intel's version of Hadoop makes three overall promises to enterprises. They are security, performance and simplified management.

 

The company has put a lot of energy in enhancing security features of Hadoop. The bigest effort on this front is silicon-based encryption. A big limitation with Hadoop has been inability to perform analytics on encrypted data, Sharma says. At enterprises that are required to keep data encrypted at all times, that data never makes it to a Hadoop cluster.

 

Intel has modified Hadoop to make analytics of encrypted data possible and contributed this functionality to the open-source project.

 

The second problem with encryption is that it requires a tremmendous amount of computational horsepower. This is where modifications specific to Intel chips come in. The Intel Distribution supports the company's AES New Instructions in Xeon processors. This basically makes analytics on encrypted data run a lot faster, Sharma says.

 

Intel has also tweaked Hadoop to take advantage of Xeon components optimized for high-performance IO and storage, which are solid-state drives and 10 Gigabit Ethernet. “When you do take advantage of the hardware, the performance of the software environment is so much better,” Sharma says.

 

Finally, the Intel Distribution stramlines and automates hardware management. These capabilities are under the umbrella name Intel Manager for Hadoop. Part of the manager is the Active Tuner software, which automatically configures hardware in the Hadoop cluster for the workload at hand – something a performance expert does in a typical Hadoop situation. Intel's algorythim uses about a dozen configuration parameters to adjust the cluster based on the type of workload it has to process.

 

While Intel optimized the distribution for all processors in the Xeon line, the benefits are most clear when deployed on Xeon E5 servers. There are use cases, however, where a combination of E7 and E5 machines works best. This is when the NameNode in the Hadoop cluster needs to have high resiliency. The NameNode is the server that keeps track of which file is where on the cluster. E7 has some advanced high-availability features some users may feel are necessary for a NameNode server, while the rest of the cluster would consist of the worker E5 nodes.

 

The wild Hadoop west

Intel's go-to-market strategy for the distribution includes both direct software sales and sales by other vendors as a solution. SAP, for example, is working to incorporate the Intel software into its Hana analytics solution, Sharma says. This approach, he says, is a more effective way to reach enterprise customers. “At the end of the day, enterprises that want a complete enterprise solution … will gravitate towards that,” he says.

 

The Intel Distribution will be contending with existing market incumbents that have built business models around Hadoop already: Cloudera, Hortonworks and MapR. There is also new competition on this front with an establishment player of Intel's own caliber. Coincidentally or not, storage giant EMC announced its own Hadoop distribution one day before Intel dropped its announcement.

 

EMC's distro is called Pivotal HD and features native integration with the vendor's Greenplum parallel-processing database.

 

Intel's competitors in the chip market also have made substantial plays in the Hadoop market. AMD announced in March that its SeaMicro SM15000 server has been certified for Cloudera's distribution of the analytics framework. Continuing the product roadmap SeaMicro has before AMD bought it, however, the microservers come with either AMD or Intel Xeon processors.

 

There are also vendors out there selling Hadoop-enabled ARM clusters, so competition is wide and varied. Still, who knows Intel's hardware better than Intel? And since Intel hardware is already inside the majority of the world's data centers, the possibility that the future Intel wants will be the future Intel gets is very real.

 

A version of this article appeared in the 29th edition of theDatacenterDynamics FOCUS magazine, out now.

CONNECT WITH US

Sign in


Forgotten Password?

Create MyDCD account

Regions

region LATAM y España North America Europe Em Português Middle East Africa Asia Pacific

Whitepapers View All