It is expected that data lakes will reach an annual market volume of $20.1 billion by 2025, according to Research and Markets - despite the term being just over a decade old. The concept of the data lake is attributed to Pentaho founder James Dixon. Today it is usually defined as a system or repository of data that is stored in its natural (i.e. raw) format. Usually, a data lake is in the form of a repository of data in multiple formats (including unstructured file and object content) from many sources, which needs to be analysed for a business purpose.
The term data lake increasingly appears in connection with the field of big data and the ability to gain knowledge from large volumes of data with the help of analytics tools.
Semi-structured storage
Since data lakes aggregate data from various sources - such as business data from ERP systems, customer data from eCommerce databases, time series data, event streams and files from document repositories (to name just a few) - data lakes can quickly reach capacities in the petabyte range and beyond. This means we are dealing with a data volume that puts data lakes beyond the reach of traditional database technologies, such as the relational database management system (RDBMS), which were originally designed to handle structured data. This is one of the reasons why new storage solutions such as the Hadoop distributed file system (HDFS) have emerged as a more flexible, scalable way to manage both structured and unstructured data, commonly referred to as "semi-structured".
HDFS is widely used as a data lake storage solution, especially in connection with the tools of the Hadoop ecosystem: for example, MapReduce, Spark, Hive or Hbase. While Hadoop and HDFS are widely adopted, a number of recently developed analytics tools (including Splunk, Vertica, Elastic) are now available on the market for analyzing the high volumes of data in data lakes.
The aim of data analytics is to find patterns that provide relevant and beneficial insight for the organization. Let’s take an example from eCommerce, where big data analytics can identify variances in the sales success of certain products at different times of the year. For this type of application, HDFS has its strengths and weaknesses, like any other technology. A major limitation of HDFS is that compute and storage resources are tightly coupled when it scales because the file system is hosted on the same machines as the application. So, as the computing capacity grows, so does the memory. This can be expensive for some computationally intensive applications as it requires additional storage resources, or vice versa (lower computing requirements due to higher storage capacity).
Some commercial vendors have optimized the original open-source implementation of HDFS, but ultimately new data storage solutions have emerged that fundamentally improve scalability and flexibility.
Multiple sources
In order to be able to fully analyze and implement the wealth of information and insights in these massive data stores, organizations depend on both the analytics tools and the storage repository in which the data is stored. The latter is arguably the most important component. The repository must process data from multiple sources with just the right performance and be able to grow in both capacity and performance so that data is widely available to applications, tools, and users. As mentioned earlier, databases and file systems (including HDFS) have played a role in data warehousing and data lake implementations. There are also object stores for on-premises implementations and cloud object storage services as a data lake repository.
Object storage offers fundamental advantages for data lakes. First of all, the handling of data in an object store is highly flexible. In particular, it is not necessary to define the "schema" of the data to be stored, as would be the case in RDBMS, where both the structure and the relationships between tables for complex queries have to be predefined. Object storage systems can store all types of files without the need for this predefinition and with no limit on the volume of data.
When it comes to access, more and more analytics applications are utilizing the Amazon S3 API (such as Splunk SmartStore and Vertica's Eon mode). Hadoop ecosystem tools such as Apache SPARK are also able to access object storage through a Hadoop Compatible File System (HCFS). In fact, this is supported directly via the S3 protocol. Over time, the number of tools that object storage-based data lake repositories can leverage will grow.
In addition, many modern object storage systems also support independent scale-out for both capacity and performance, thereby eliminating the rigid/coupled model described previously for HDFS. Many analytics tools vendors have embraced this model in their offerings to gain this efficiency advantage, for example Splunk SmartStore and Microfocus Vertica EON mode support S3 object storage. Take large MapReduce workloads for example. Users can upgrade the compute tier as the performance host for the MapReduce application and then scale the object storage in terms of capacity and throughput independently. For data lakes, this is a critical benefit of using object storage for large analytics projects as computing and storage resources can be scaled independently of one another.
This eliminates the need to scale both in step, which promises clear cost advantages. The ability to independently scale offers the right compute performance for data analysis - on demand. This can significantly reduce the overall cost of a data lake solution.