Why IBM loves Apache Spark

In June, IBM announced a number of initiatives aimed at bringing the power of Apache Spark - an open source distributed computing framework - to the masses.

These include donation of code, educational drive and opening of a dedicated Spark Technology Center in the heart of the Silicon Valley. But why is IBM is throwing so many resources at an open source project?

“This is the first time so many capabilities have come together in a single platform,” Anjul Bhambhri, vice president of Big Data and Analytics business at IBM told DatacenterDynamics.

“It allows you to deal with structured, semi-structured and unstructured data, build models, score those models, run machine learning algorithms, and really be able to build very rich analytic applications. And the speed that we are seeing is just phenomenal.”

We met Bhambhri at the Apache: Big Data conference in Budapest, where she told the audience that Spark was already used to cure diseases, map the universe and get rid of traffic jams, among other things.

Operating system for Big Data

Spark is an open source cluster computing engine that relies on processing data in-memory for speed. It was born at the AMPLab of the University of California, Berkeley in 2009 as a PhD thesis by computer scientist Matei Zaharia. Zaharia also co-created Apache Mesos cluster manager (commercialized by Mesosphere), and played an important part in early development of Apache Hadoop.

Spark is a relatively recent addition to the Apache Software Foundation (ASF) roster. The code base was donated to the ASF in 2013, and in just two years, Spark has emerged as the most active top-level project, with more than 1,400 patches committed to code between July and September.

Inside IBM, Spark is used in the Watson Health Cloud, Big Insights and InfoSphere DataStage products. At the end of the month, IBM is set to introduce predictive analytics capability that leverages Spark.

Bhambhri described the framework as a single “toolbox” for all your analytics needs: “It’s improving developer productivity. Data engineers, data scientists, application developers are able to collaborate on one platform. Otherwise you would need six or seven different products to do these things.”

The machine learning component within Spark is particularly interesting, and this is the area where IBM is making a serious contribution, by open-sourcing its SystemML technology.

“Running machine learning algorithms on Big Data is possible now. This means the machine learning algorithms are going to get smarter and smarter, because they are not learning from tiny bits of data, but they are really learning from all the data that’s available,” Bhambhri told DCD.

“You get excited when you see the kinds of problems people are solving.”

Teaching a million developers

And then there’s IBM’s educational campaign, which promises to teach a million developers how to work with Spark. Bhambhri is quick to dispel any fears that this is just a stealth marketing tactic aimed at selling related commercial products.

“We are teaching them [developers] about what’s available in open source. How to use those capabilities. How to tune the Spark stack, how to tune their applications… And then, if there are other capabilities that they need, so that their applications are really solving problems that they could not solve before, then they can certainly leverage what IBM has to offer. But it’s not just for that reason. A lot of our courses are based only on open source.”

In addition, IBM is hiring close to 100 developers to work exclusively on open source projects – quite a commitment for a company that earns its bread selling proprietary code. These brave men and women will harden Spark, fix bugs and clean up documentation – the latter considered something of a chore by most community members.

Bhambhri hopes educating more people on latest tools will help solve some of the serious challenges facing humanity as a whole. Her favorite use cases for Spark are genomics and personalized medicine. “This is just the start,” she says, glint in her eyes.

More information about IBM and Spark, as well as educational materials, are available at spark.tc.

Why IBM loves Apache Spark

Operating system for Big Data

Teaching a million developers

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence