Data projects are now moving from the experimental stage to providing real returns on investment, with tools such as Hadoop and Cassandra forming an integral part of organizations’ enterprise-wide analytics platform. However, big data is a big investment, both in terms of money and time.
The faster companies are able to glean insights from their data to support their business decisions the more valuable those insights are. With this in mind, how well a company’s big data tools perform and the speed at which they can deliver information is critical.
Companies need to take a best practice approach to big data performance to ensure they are eliminating the risks and costs associated with poor performance, availability, and scalability.
As its name suggests big data is defined in part by the sheer size of datasets that are being created in today’s enterprise, as well as the velocity at which this data is created. For organizations looking to better understand their customers, reduce operational costs, or gain a competitive edge through better informed decision making, this influx of data can provide the answers: the challenge is in how to access and interpret the data. Up until recent years this has been impossible, but the new wave of big data solutions is changing this, allowing petabytes of information to be analysed in hours instead of months.
This is genuinely transformative technology that can change the way enterprise organizations operate. However, big data technologies also bring with them new performance risks and challenges. These must be addressed and managed; otherwise the benefits of being able to churn more data, more quickly won’t materialize, and it won’t be long until the end-users start to complain.
Predicting performance disasters with Hadoop and Cassandra
These new and unique performance challenges revolve around NoSQL databases such as Cassandra, HBase, MongoDB and large scale processing environments, such as Hadoop. Most large organizations create gigabytes of data on a minute-by-minute basis and as such are looking at Hadoop MapReduce to help automate analysis for complex queries on these extremely large data-sets. This scale of processing and analysis was previously impossible using traditional analytics tools, which are unable to cope with the volume of data.
However, Hadoop doesn’t just run in isolation as MapReduce jobs are always run in the Hadoop cluster. For these jobs to run efficiently and smoothly the applications need to rely on the availability of an underlying infrastructure of servers and virtual machines. The data structure layer on top of the raw data, in addition to the data and information flowing through the system, also have an impact on both distribution and performance of MapReduce jobs.
Any performance problems within the Hadoop environment will slow down the analysis tool, which could impact on Service Level Agreements (SLAs) should results be delayed. They could also cause increased pressure on an organization’s hardware resulting in increased capital and operational investments as the tool will run inefficiently and need additional power.
Essentially, Hadoop and the MapReduce jobs it runs have a number of moving parts. Given their distributed nature, this adds a layer of complexity that in many cases can leave IT administrators blind to what’s happening inside the application. Trying to locate and remediate performance bottlenecks and problem hot spots manually is impossible; automating application performance management is the only way to combat performance issues in these environments due to the sheer volumes of data that need to be analysed to detect problems.
Meeting your SLAs
An inability to effectively manage the performance of big data applications will in many cases mean a failure to meet SLAs, which will inevitably lead to financial loss. Simply throwing more hardware at the problem might provide a solution in the short-term; but the ongoing associated costs make this a very inefficient solution. As with any enterprise application, it’s important to be able to see a high-level overview of the systems and how they are operating so that bottlenecks and issues can be flagged early on. Yet it is equally important to have the tools available to drill down into the application code so that these problems be resolved efficiently and accurately, ensuring that administrators do not have to disrupt other applications by churning through irrelevant log files.
A similar approach needs to be taken to NoSQL databases. Cassandra NoSQL databases can scale horizontally and they allow for very low latency requests and so are typically being used for applications where real-time insights are the name of the game. However, for all the speed that Cassandra provides, this speed is dependent on native applications meaning Cassandra is only as fast as its parts.
This means that to benefit from the speed of Cassandra, companies need to take into account the performance of all parts across the application delivery chain. Having end-to-end visibility across the entire service delivery chain and all transaction processes is crucial to spotting and addressing problems in the NoSQL environment.
Maximize ROI with your big data
In short, for big data solutions to perform and deliver on the promises made by vendors, a new approach to application performance management is needed: one that goes beyond log-file analysis and point tools.
Companies must not get caught in the trap of thinking their traditional approaches to managing application performance will work. Instead, they should seek out new approaches that can cope specifically with the architecture of dynamic, elastic big data environments.
With this new approach, enterprises can now make highly optimized big data implementations much easier to achieve, leaving them free to maximize their ROI from the interpretation and delivery of good value big data insights.
Michael Kopp is a technology strategist at Compuware