High-performance computing (HPC) is the powerhouse of innovation, insight and commercial competitiveness. It is an indispensable resource in this digital age. For example, the sophisticated computer models run by HPC have dramatically increased means of measuring and analyzing the weather in recent decades; helping to improve forecasting and to simulate the impact of climate change and other devastating environmental events like hurricanes.

The same technology can shave milliseconds off financial trading times, improve compliance, risk detection and analysis, and even accelerate prototype design, from shampoo to race cars, in a range of manufacturing, engineering and industrial environments.

Well suited to tasks that are computationally, numerically and data intensive, HPC once belonged to the rarefied realm of big corporations, government bodies and research institutions. But cloud computing has been a catalyst for change. It has radically democratized access to supercomputing capabilities and an abundance of compute for smaller entities; driving a ‘cloud first’ mentality. Of course, this has gone hand-in-glove with wider market digitalization, with more and more business applications moved out of on-premise data centers in a bid to increase agility and cut costs.

The landscape today

Summit supercomputer
– Oak Ridge National Laboratory

Today, the hyperscale cloud vendors have taken up huge swathes of the HPC market, offering elastic, almost limitless compute scalability. In the past, it typically took two or more years to refresh the technology in a standard supercomputing data center. The process involved a review of available technology followed by a trial or proof of concept phase, with the RFP requirements widely published. Unsurprisingly, and reflective of where the expertise once was, facilities were filled with heavy metal from the usual suspects of Cray, IBM, HPE, NEC etc. (They are now also full of high-end servers from the same names, it must be said).

But, the idea that hyperscale cloud builders could effectively fuse some fast networks with a range of GPUs and some sophisticated middleware, in order to manage simulation and modelling workloads - and then call this true HPC - is misguided. This model is a poor fit for HPC. These applications are computational complex, dense and demanding. While for some HPC is often about getting the most compute for the least cost, its successful delivery, and the optimal running of HPC applications, relies heavily on performance and speed.

The big cloud providers have (somewhat understandably) responded to the demand for HPC clusters by utilizing their own servers in volume. By using slightly behind the bleeding edge hardware, they can compensate for performance, while server CPUs can then be augmented with commercial GPUs for larger-scale HPC applications. But this isn’t optimal. Relying on “lots of compute” doesn’t simply make a great HPC environment, where applications are deployed in optimal conditions and run as efficiently as possible. To achieve this, you need a tailored cloud environment with an application first approach that offers genuine HPC.

HPCNow! recently undertook an OpenFOAM stress test of the same HPC configuration on Amazon Web Services, Microsoft’s Azure cloud and a private, bare metal HPC cloud to better understand the performance impact (full disclosure: our hpcDIRECT solution was included in the test). The metric was the wall time (the actual time taken from the start of a program to the end) involved in the simulation of airflow around a motorcycle. In order to be reflective of a reasonable, medium-sized HPC workload, the number of elements increased from 200,000 to 41.6 million.

Five runs were attempted for each of the HPC cloud configurations revealing that, of all of them, the reproducibility of scaling results was pretty good on AWS when a small number of cores was used, but as the core counts grew, so did the variability in the runtime, having a negative impact on the final wall time. This didn’t happen with bare metal, which was a whopping 30 percent faster across the board. There were also some notable issues scaling up the OpenFOAM simulation on the Microsoft Azure cloud too. The runtime of 256 cores maxed out at 1572 on the second run.

Public hyperscale clouds are fantastic compute resources for a broad range of enterprise, office and cloud supported applications and workloads, offering vast scalability, flexible access points and pricing plans to suit any deployment and timescale. But they rely on virtualized servers that are often disparately located across borders and are often far from storage. When thinking through location, there is also a strategic decision to be made about the best place geographically to locate applications – for example, some locations are powered by renewable energy that can have dramatic impact on an organisation’s bottom-line and its environmental footprint too.

And, for more demanding HPC users, especially those keen to embrace customized machine- and deep-learning applications in the near-term or for AI startups transitioning from prototype stage into production products, a rethink is needed. Unfortunately, custom configuration of machines to suit their own applications goes against the very principle of hyperscalers too. A high degree of homogeneity is needed with public clouds in order to make it possible to operate infrastructure at scale. For HPC users running bespoke or highly customized applications that need precise configuration or increased support time from an HPC engineer to optimize their deployment, you aren’t going to find that in a hyperscale cloud. For these specialist applications you need a tailored ‘cut to fit’ service.

Hyperion reports that ten percent of HPC is now performed in the cloud. The only way for this is up. As businesses become more reliant on HPC outputs, they must look for a truly optimized environment where an HPC cluster can be deployed in a repeatable fashion and where power and cost is sustainable and won’t break the bank. Once upon a time, “optimized” meant putting clusters in one place with a job scheduler and that was that. Today, each replicated deployment must be documented and automated as it changes over time in order to maintain performance integrity. The two-sided matter of power and affordability is also a core element of the optimization question too. The good news is that it can be done.

Ultimately, the potential for running sophisticated HPC applications in the cloud is enormous, but the fundamental challenges of performance, speed and cost must be faced up to and addressed if we are to see the true benefits of doing so.