When it comes to data resiliency, the classic SAN storage solutions that run on RAID (Redundant Array of Independent Disks) had a lot going for them, and in some cases, still do. High levels of familiarity among the professionals deploying them is one significant advantage. In addition, RAID’s ability to run striping, mirroring and parity requests all make it the traditional go-to data resiliency option.

However, the demands of the new data economy are placing increasing strain on this approach. Many data centers, as well as enterprises, are finding it too costly. Recovery times are taking too long and, combined with RAID’s vulnerability issues when it comes to the recovery process, the time has come to consider alternatives.

In a departure from hardware-based models, erasure coding (EC) is an option that is rapidly gaining ground. EC is based on algorithms and as such is not tied to any particular hardware. It does not require a specialized hardware controller and provides better resiliency. Even better, it confers protection during the recovery processes as well. Depending on the degree of resiliency, complete recovery can be achieved when only half of the data elements (any elements) are available. In this regard, it has a big advantage over RAID. Furthermore, compared with mirroring, EC also consumes less storage.

hard drive design blueprint
– DCD / Fay Marney

How EC works…

EC breaks the data into fragments before it augments them. It then encodes them with redundant pieces of information. These encoded fragments are distributed across a wide variety of locations. Even if it becomes unreadable on one node, it can still be pieced together using information about the data stored elsewhere.

As with most solutions, there are trade-offs that enterprises, data center and storage professionals will need to consider carefully. Firstly, EC is CPU-intensive and can cause latency issues. However, it is worth noting that the problems with latency aren’t a given. It is the result of the balance between storage efficiency and fault tolerance.

The other major trade-off with traditional EC is the need to balance these two performance metrics, storage efficiency and fault tolerance. The relationship between the two is inversely proportional. Storage efficiency is an indicator of additional storage required to assure the resiliency, whereas fault tolerance is an indicator of the possibility of recovery in the event of element failures.

With EC, the more distributed the data – and generally that will mean the more geographically dispersed – the longer it takes to recall if from different data center locations and systems. And latency is a given.

When things go wrong: node failure

While the issue of node failure and otherwise degraded reads have dogged data centers for some time, the new breed of hyperscale data centers exacerbate the challenge of data resiliency. Not all erasure code algorithms are made equally, but the best solutions are encoded with low repair bandwidth and low repair degree capabilities.

That is because modern EC has evolved to address the new data demands. So it includes local regeneration codes, codes with availability, codes with sequential recovery, coupled layer MSR codes, selectable recovery codes and others, which are highly customized.

Optimizing EC

As previously described, although there are serious advantages to erasure coding, the reality is that it is a compute intensive undertaking. This is precisely why academia and industry have research projects well underway examining ways to optimize and off-load various aspects of EC. Several off-loading solutions are emerging that are promising.

  1. Hardware innovation: It’s not all down to the algorithm! As hardware evolves, compute resources like GPU or FGPA will become more efficient.
  2. Parallelization of EC algorithms: This is based on the concept that when multiple processes are executed at the same time and when modern resiliency codes have some cases of the vector codes, these vector approaches make it possible to leverage GPU cores and high-speed on core memory (like Texture Memory) to achieve parallelism.
  3. Fabric acceleration: Next generation host channel adapters (HCA) offer calculation engines, making full use of features like RDMA and verbs. Encode and transfer operations are handled in HCA. With the RDMA it proposes more acceleration for storage clusters.

What’s ahead for EC?

The future is bright for erasure coding! The rate of innovation is astounding when it comes to data resiliency, compression and de-duplication. Already, commercial opportunities are being opened up with a wide number of new use cases thanks to extreme low latencies of NVMe technologies, tighter integration of storage with application characteristics and newer virtualization options.

Data center and storage professionals should familiarize themselves with erasure coding. It provides better resiliency, better data protection during recovery and requires less storage than traditional RAID solutions.

Dinesh Kumar Bhaskaran is director of technology and innovation at Aricent