Archived Content

The following content is from an older version of this website, and may not display correctly.

Supercomputers are where the latest and greatest in information technology gets to shine. It is often where the bleeding-edge in engineering gets applied to solving and answering some of humanity's biggest problems and questions: from origins of the universe to climage change and genetics.

 

At the time of its launch, a supercomputer is typically built of the best processor and server technology available then. When designing such a cluster of thousands of computing monsters strung together, engineering an adequate storage system is a big challenge.

 

Earlier this year, the Texas Advanced Computing Center (TACC), of the University of Texas at Austin, brought online Stampede, its latest supercomputer. We caught up with Tommy Minyard, director of advanced computing systems at TACC, to learn about the design of the storage system one of the most powerful supercomputers in the world uses.

 

Stampede's dedicated storage cluster's capacity is 14 petabytes. It consists of Dell PowerEdge C8000-series servers, each server packing 64 three-terabyte drives.

 

A Luster cluster

The servers work together as a Luster file system. Luster is a piece of open-source software popular in the supercomputer circles that string a bunch of commodity servers into a highly scalable storage cluster.

 

The cluster currently consists of 76 storage servers, divided between seven different file systems, the largest of which is using 58 of the servers. These can sustain data-transfer speeds of 150 Gigabytes per second, Minyard says.

 

TACC has been using the Luster file system with its other supercomputers for many years. As academics, they fully support the open-source project and contribute bug fixes and tools to the Luster community.

 

The other most popular parallel-file system in the high-performance-computing (HPC) world is IBM's GPFS, which is similar to Luster in concept, Minyard says. But it is an IBM product and reaquires a license.

 

Another popular one is Panasas, but it comes only as a combination of hardware and software, all of which is proprietary.

 

High availability

Storage servers in the cluster are configured to pick up each other's workload if some experience a failure. In case a server is lost, one of its failover-pair servers will take over management of its 64 drives.

 

The failover mode means a decrease in performance, since the failover server is handling twice the amount of disk. But, Stampede's availability requirements dictate that every node in the supercomputer has acces to the file system to run programs.

 

Once the failed server is repaired and goes back into production, the failover machine gets back to operating normally.

 

Commodity servers

There are several reasons TACC went with Dell as Stampede's storage vendor. One reason is that Dell also provided the system's compute nodes, and TACC could leverage bulk pricing.

 

The reason the center chose not to go to one of the big storage vendors, such as EMC and NetApp, is that with Luster, to scale to high bandwidth capacity, you need a lot of servers. The more standardized file servers are not really well-suited for that kind of scalability, Minyard says.

 

Finally, Stampede's compute nodes are interconnected by an InfinniBand fabric, which TACC's simply plugged the Dell storage cluster into. This way they avoided having to build a separate 10-Gigabit-Ethernet network for storage connectivity.

 

“That's one of the ways we were able to achieve such high bandwidth rates to each one of the servers,” Minyard says about the InfiniBand fabric. It provides an “easy way to let all the storage servers talk to each other with very high speeds and low latency.”

 

Design priorities

Performance was the number-one priority in designing Stampede's storage cluster, he says. “With 14PB of raw storage capacity, you need to be able to read and write to it really quickly.”

 

The second priority was ability to support thousands and thousands of clients writing to it at the same time. More than 6,400 of Stampede's compute nodes may be accessing the storage cluster at any time, either reading or writing to it, and the cluster needs to be able to handle that.

 

A standard NFS file system cannot support this kind of activity, Minyard says. It will either fall over or be impossibly slow.

 

After the crtical performance considerations, the third design priority was scalability, addressed aptly by Luster, and the fourth one was openness. An open-source system allows TACC engineers to build and maintain their own versions of the file system and apply bug fixes quickly.

 

The fifth design requirement was to make sure that from the end user's perspective, the system behaved just like the file system on their own desktop. Minyard's team did that by making sure it complies with POSIX, the set of IEEE standards for operating-system compatibility.

 

As of early April, when we talked with Minyard, several hundred projects had already run on Stampede, and about two petabytes of data had been written to the storage cluster. More than 1,000 users had accessed the system, and it was only the beginning.

 

At this rate, those 14 petabytes of capacity will fill up rather quickly, as Stampede continues to advance computational science, which has become so crucial a tool in the way we try to understand ourselves and the world that surrounds us.

 

A version of this article appeared in the 29th edition of the DatacenterDynamics FOCUS magazine, out now.