While the term ‘big data’ may feel old to us now, the concept remains relatively new. Hadoop helped popularize and simplify this analytical use of data in 2010, and since then the technology and techniques involved have developed at a break-neck speed.
Hadoop’s project was massively powerful and scalable, offering the ability to safely store and manipulate large amounts of data with commodity hardware. As such, a large community formed to develop Hadoop.
Hardware has since dwindled in popularity, and commodity hardware has similarly fallen by the wayside. Cloud, compute and storage are instead purchasable on demand, and analytics is a service to be bought by the hour.
So, what’s happened to Hadoop? Why have so many companies abandoned their on-premise Hadoop installation in favor of the cloud? And does Hadoop have a place in the cloud?
The pre-cloud beginning
Hadoop’s origins can be traced to the Apache Nutch project – an open-source web crawler developed in the early 2000s by the same Software Foundation that pioneered open-source software.
The project’s web crawler, developed to index the web, was struggling to parallelize. Nutch worked well on one machine, but the task of handling millions of web pages - “web-scale” - seemed an overwhelming task.
This was set to change with the release of Google’s 2004 paper, titled “MapReduce: Simplified Data Processing on Large Clusters.” Detailing how the company indexed the rapidly-growing volume of content on the web by spreading the workload across large clusters of commodity servers, Google’s paper provided the perfect solution to Nutch’s problems.
By July 2005, Nutch’s core team had integrated MapReduce into Nutch. Shortly after, the novel filesystem and MapReduce software was spun into its own project called Hadoop – famously named after the toy elephant that belonged to the project lead’s son.
When Yahoo! used Hadoop to replace its search backend system in 2006, the project quickly accelerated. Following adoption by Facebook, Twitter and LinkedIn, Hadoop quickly became the de facto way to work with web-scale data.
Hadoop’s technology was revolutionary at the time. Storing large amounts of structured data had previously been difficult and expensive, but Hadoop reduced the burden of data storage. Organizations that had formerly discarded all data excluding that deemed most valuable now found it cost-effective to store lots – or “big” amounts – of data.
Not the solution, but a framework
Many businesses have set up Hadoop clusters in the hopes of gaining business insights or new capabilities from their data. However, upon trying to execute a business intelligence or analytics-based idea, many companies have been left disappointed.
More often than not, businesses installed the Hadoop cluster before defining what their use-case for it would be. Finding Hadoop to be too slow for interactive queries resulted in disappointed businesses who had misunderstood Hadoop’s capabilities.
Rather than being a big data solution, Hadoop is more of a framework. Its broad ecosystem of complementary open-source projects rendered Hadoop too complicated for many businesses and required a level of configuration and programming knowledge that could only be supplied by a dedicated team.
But even with a dedicated internal team, Hadoop sometimes needed something extra.
For instance, King Digital Entertainment, developer of the Candy Crush series, couldn’t fully leverage Hadoop, finding it too slow for interactive BI queries that the internal data science team demanded. It required an accelerator on a multi-petabyte Hadoop cluster to allow data scientists to interactively query the data.
Cloud-driven evolution
The changing world of data warehousing has meant that Hadoop has had to evolve. When Hadoop was created in early 2006, AWS was set to launch a few short months later, and the public cloud didn’t yet exist. The IT landscape in which Hadoop had its formative years and experienced its peak popularity has changed immeasurably.
Consequently, the way Hadoop is used has also changed. Most public cloud infrastructure providers now actively maintain and integrate a managed Hadoop platform, with examples including AWS Elastic Map Reduce, Azure’s HDInsight and Google Cloud Platform’s DataProc. The cloud-based Hadoop platform is most commonly used today for batch processing, machine learning or ETL jobs.
Moving to the cloud means that Hadoop is ready to be used immediately and on-demand, with the complicated set-up already taken care of. It’s clear that Hadoop has benefited from its move to the cloud, but it’s also no longer the only option for cheap, secure and robust data storage. With increased competition, Hadoop is no longer at the centre of the data universe, instead catering for particular workloads.
The future for Hadoop
We’ve seen that demand for on-premise solutions remains high, and Hadoop is still a great on-premise solution. This demand is unlikely to diminish any time soon. There’s no need to change what’s already working well, and Hadoop is great for certain organizations.
However, the majority of businesses are looking to run their own data warehouse using public cloud services. Driven by this increasing demand. For those looking to run a job at scale, Hadoop is a great option, and in the cloud there’s a level of ease for Hadoop that hasn’t been enjoyed before.