Is EMC waving or drowning in the data lake?

Archived Content

The following content is from an older version of this website, and may not display correctly.

“One of the things we realize is that Hadoop is sort of, kind of eating the world,” admitted Pivotal vice president Nick Cayou, at one point during his company’s joint rollout with EMC of its updated Isilon storage platform. “We needed to adapt our strategy to accommodate these new adoption models.”

It’s a fair statement, and a thoughtful strategy. Hadoop blew the storage market wide open by enabling, for the first time, applications to access very-large-scale databases that stretch beyond the capacity of one volume to contain them. It’s just too difficult to assemble the logic necessary to segment large databases into slices within the context of the server’s native operating system. Hadoop actually introduced a new operating system, dedicated to the task of accessing unstructured, uncleansed data, even before it’s filtered into formal data warehouses.

It also enabled businesses to incorporate both ordinary storage devices, and arrays containing them, with public cloud storage capacity as well. EMC, whose business is storage hardware, had to reinvent its value proposition. It had to incorporate cloud-style self-service provisioning into its existing top-of-the-line storage platforms, while at the same time enabling the public cloud standby capacity that organizations were now demanding.

“These files in these [Hadoop] data sets are eclipsing the size of traditional enterprise data warehouse data sets by a factor of at least 10,” said Cayou.

The Yellow Elephant in the room
For the new version of its Isilon platform, EMC incorporated the Hadoop Distributed File System (HDFS). But to refrain from being completely absorbed by the big yellow elephant in the room, EMC and Pivotal incorporated Isilon’s OneFS operating system, along with reconfigured data query functions from Pivotal’s Greenplum appliance. This way, the two companies can portray Isilon as being more efficient than a bare metal Hadoop cluster.

In an effort to paint Hadoop as inherently inefficient, Cayou repeatedly pointed out how its HDFS file system stores three copies of each data block for redundancy. “In terms of an architecture, how efficient is this? I’m staging my data in a NAS device, and I’m spending money on that, if I’m an enterprise — which may be good for a storage vendor. But obviously... there are more efficient ways to do that. Then I’ve got, at a minimum, three copies of data spread throughout my commodity computing cluster. Beyond that, I’ve got this issue of trying to make the data accessible to my end users.”

The limited, but still measurable, extent to which EMC and its divisions are willing to embrace cloud dynamics is symbolized by its up-and-coming storage metaphor, the data lake, to which DatacenterDynamics introduced you a few days earlier. Isilon is EMC’s existing “scale-out” storage platform, incorporating services from Pivotal that include integrating cloud-based capacity.

By comparison, VMAX3 is the latest version of EMC’s high-performance storage platform. But to improve VMAX3’s capacity as well as its speed, EMC is rapidly incorporating more dynamic SSD-based storage tiers as higher-speed caches. And it is moving steadily toward a time when VMAX3 can enter into the data lake as well, as Pivotal CEO Paul Maritz clearly implied in his speech to attendees of EMC’s rollout event Tuesday.

EMC technical evangelist Vince Weston extended the pooling metaphor to the multi-core CPUs used in VMAX3. “In the heart of the director is now this dynamic collection of CPUs that are divided into pools,” Weston explained. “In the past, when we set up CPUs, CPUs were tied to specific ports, and the software managed them as very specific tasks. We’re now doing pools of CPUs, and we now have tasks spread across the CPU pools so that we can have a lot more flexibility in how we can provide the services.”

The VMAX3 pool
There are three tiers of CPUs in a VMAX3 pool (I know it’s a mixed metaphor, but it wouldn’t be the first), as Weston pointed out to. In the center tier are a pair of CPUs that run EMC’s new Hypermax OS, which handles all the data services on VMAX3. Below this tier is a pool of processors that drive disk services, and above the center tier is a group of processors that manage the host connections. The split between the number of processors between the upper and lower tiers can be adjusted for a particular VMAX3 storage server, depending upon the role it plays.

“For example, if you have a write-heavy workload that drives a lot of I/O to disk,” he said, “we can move the CPUs between the pools and allow you to have more bandwidth, more activity going on to the drives. Similarly, if it’s a read/hit environment, CPUs can move up to the front end. We have multiple options on how we rearrange this.

"On-premise storage hardware may not be “the cloud,” but EMC nonetheless is attaching a cloud-like provisioning and cost structure to VMAX3. Like the color-coded sponsorship levels of a technology conference, VMAX3 will be configurable in service levels. Exactly how many levels and what their final names will be is unclear; Weston spoke of one group of colors, while the slide behind him showed another, and at least one other slide EMC is circulating shows another still. But you get the basic idea: some variation on “bronze / silver / gold / platinum.”

Through the use of a “sizing tool,” EMC will evaluate the dynamics of the organization’s current storage workload. That tool will issue recommendations that will fit within the color-coded structure.

“The really nice thing is, you already told us up front what the workload is going to be,” said Weston. “We’ve designed the system with a certain workload in mind, so why should we not just go ahead and configure that for you? So out of the ordering process, we will take the drives [and] the RAID types that you’ve selected, put those together pre-configured in the array so that you wind up with a system that arrives ready-to-use. The service levels are built in, the drives are already in the RAID groups, there’s no work to do about any of that... So when the array arrives, you can just jump in and start using it.”

The clear distinctions between the VMAX3 and Isilon platforms
Isilon includes scale-out storage (SSD/HDD/cloud) and the services and structures that enable it, like Hadoop’s HDFS file system.

VMAX3 enables services such as optimized tiering (moving “hotter” data in increments smaller than entire blocks into SSD for faster access); up to 256 snapshots of any given source with user-defined names, that are stored as versions instead of separate volumes; up to 1,024 link targets per data source for policy-based access of data from multiple users and direct backups into the data domain over all-Fiber Channel fabric, without the use of a separate backup server.

Clearly it’s easier for EMC to add this latest set of upgrades to its long-standing technology, rather than to the new world of Isilon. But there is one big d'ata lake' metaphor, and both Isilon and VMAX3 are wading in it to some extent. If EMC is going to get its feet wet, it can’t avoid finding itself neck-deep in big data-inspired architectures.

Is EMC waving or drowning in the data lake?

Archived Content

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies