In recent years, Apache Spark – an open-source processing engine for huge data sets that distributes processing tasks across multiple dedicated hardware units – has skyrocketed in popularity amongst developers and data scientists, fast becoming the industry standard for big data analytic queries.
Compared to other batch processing frameworks, Spark supports multiple programming languages and can process much larger quantities of data at faster speeds, allowing it to scale alongside a company’s growth. In fact, virtually all major cloud providers including AWS, Microsoft Azure, and GCP, have chosen to offer variants of Spark as a SaaS within their own cloud platforms, and managed services such as Snowflake, Athena, and Redshift have also added connectors to Spark.
Though its performance capabilities are remarkable, the sheer amount of data that is being analyzed by enterprises around the world is growing greater and greater, so naturally, Spark’s query processing speeds are becoming less efficient. Unfortunately, the improvements yielded from various software solutions and hardware tuning have had limited benefit and won’t be scalable in the long term as data continues its robust growth trajectory.
As crucial as data has become to business intelligence, scientific discovery, and innovation, the rate it is being generated is outpacing processing speeds. Resolving this issue remains a huge challenge for software and hardware engineers, despite the considerable resources and manpower tech giants have and are devoting to it.
It wasn’t only tech giants that got involved. The data analytics community also developed a number of open-source frameworks - most notably Hadoop in 2006, Presto in 2013, and Spark in 2014. This allowed private and public enterprises worldwide to assist in pushing forward the performance and scale of data analytics. Yet, fast forward to 2022, and the problem still persists.
Last year, Intel and Kyligence introduced the successor to Intel’s native SQL engine Gazelle, Project Gluten, an open-sourced plugin that moderately accelerates Spark SQL queries by offloading workloads to other software execution engines. Later that year, Databricks followed suit, releasing its own next-generation C++ engine, Photon, which speeds up processing twofold compared to its prior query engine. This past March, Meta threw its hat in the ring, releasing Velox, a similar C++ engine designed to upgrade its analytic processing, also being integrated into Spark as part of Project Gluten.
But optimizing software comes with a tradeoff. The more software is optimized to enable better performance on niche workloads, the more other workload efficiencies are hampered, thereby limiting the generality of each software framework.
Some innovators have tried approaching the issue from a hardware perspective. Intel, for example, offers guides on data compression to speed up Spark on their Xeon processors; likewise, Nvidia implemented GPU acceleration for Spark.
Beyond the processing units themselves, tech companies have attempted to improve the performance of analytical workloads via areas of computing, including memory, storage, and networks. Such tactics have involved swapping rotation-based hard drives for solid-state drives, increasing DIMM (dual in-line memory module) population to maximize memory bandwidth, and establishing faster networks.
As methodical and laborious as these efforts have been, the results don’t seem to be keeping up with the incalculable growth of data. Although Databricks boasted a 3X performance boost from 2016 to 2021 – a 25 percent increase YoY – we are once again reaching the limits of query processing performance, as the effect of software optimizations alone plateau.
Take the custom approach
Spark queries are essential for development and innovation across industries, including pharmaceuticals, telecommunications, and many others. If we do not find a way to process and analyze big data at faster rates, these vital industries will miss out on key innovations.
Rather than spending disproportionate resources on incremental data processing improvements, it may be time to focus on improving the computational performance of data analytics by investing in custom-made hardware.
Whatever the solution, it’s time to rethink how to achieve higher processing capacities that go hand in hand with, or, if we dare, even outpace data’s immense growth rate.