There’s some debate over what can be counted as the first supercomputer, but it is possible that we may soon see the last one.

Supercomputers are unique facilities, providing exceptional computing power. So on that basis, the world’s first programmable computers in the 1940s could be described as supercomputers: they weren’t just exceptional, they were unique.

By today’s standards, the performance of 1945’s ENIAC was less than “super.” The 1,500 sq ft machine’s 40 nine-foot cabinets, housed in the University of Pennsylvania, held more than 18,000 vacuum tubes and 1,500 relays, as well as hundreds of thousands of resistors, capacitors, and inductors.

It was capable of 5,000 calculations a second, with its then-hefty 160kW energy consumption even reportedly causing blackouts in Philadelphia.

Then there’s Control Data’s CDC 6600, seen by many as the first supercomputer. It had other systems to compete against when it launched in 1964, and gave triple the performance of the previous record holder the IBM 3070.

In the decades that have followed, power has risen by orders of magnitude from the CDC machine’s now-puny three megaflops.

For its first decades, the field was led by Seymour Cray, who left Control Data after building the CDC 6600 to form Cray Research - which is now the supercomputer division of HPE.

Supercomputers have consumed huge sums of money and years of research - and, despite efforts on maximizing energy efficiency, the energy demands of high-performance computing (HPC) have kept growing.

This year, the HPC industry officially hit a major milestone - Frontier broke the exascale barrier, which means a system that is capable of at least a billion billion (1018) floating point operations per second - a target China is believed to have secretly hit one year earlier.

That performance is 300,000 million billion times (3x108) that of the CDC 6600.

The Frontier system, at the Oak Ridge Leadership Computing Facility in Tennessee, cost $600 million and uses 30MW of power, more than many data centers.

While it represents the pinnacle of computing achievement, it’s not clear whether it represents the future.

This feature appeared in the latest issue of the DCD Magazine. Read it for free today.

Last of their kind

"Leadership HPC appears to be engaging in unsustainable brinkmanship while midrange HPC is having its value completely undercut by cloud vendors," Glenn K. Lockwood, storage architect at the National Energy Research Scientific Computing Center (NERSC), said in a blog post announcing his resignation.

"At the current trajectory, the cost of building a new data center and extensive power and cooling infrastructure for every new leadership supercomputer is going to become prohibitive very soon. My guess is that all the 50-60MW data centers being built for the exascale supercomputers will be the last of their kind, and that there will be no public appetite to keep doubling down."

He left to join Microsoft.

That destination, and the timing of his departure, may well be significant for the HPC sector.

While supercomputers have become bigger, faster, and more powerful, they are also much more in demand.

No longer limited to governments, research universities, and the most well-heeled corporations, high-performance computing (HPC) is becoming a powerful tool for commercial firms, and for anyone else who can afford it.

But while everyone wants HPC, not everyone can afford the prohibitive IT hardware, construction, and energy bills of dedicated supercomputers. They are turning to HPC in the cloud.

Cloud HPC emerges

In many ways, HPC has never been as big as it is today. But that’s only if you broaden the scope beyond the standalone facilities that have come down from CDCs to the Frontier.

The fact is that you no longer need a dedicated HPC facility in order to run these kinds of applications, as cloud providers now offer HPC services that can be rented by users, allowing for temporary HPC clusters that spin up when needed.

Those providers, as we shall see, include Glenn Lockwood’s new employer, Microsoft, as well as the other cloud giants Amazon Web Services (AWS) and Google.

Last year, YellowDog created a huge distributed supercomputer on AWS, pulling together 3.2m vCPUs (virtual CPUs) for seven hours to analyze and screen 337 potential medical compounds for OMass Therapeutics.

It was a significant moment, because the effort won the temporary machine the 136th spot in the Top500, a listing of the world's fastest supercomputers. It managed a performance of 1.93 petaflops, (1.93x1015 pFlops) which is roughly 1/500th of the hard-won exaflops of the Frontier machine.

Instead of sending a workload to a supercomputing center, to be popped on a waitlist for its turn, Yellow Dog and OMass had opted for cloud HPC, where the capacity appears to be ready and waiting on-demand - as long as you can pay.

Larger and more traditional workloads are also moving towards cloud supercomputers. One of the most significant is the UK Met Office, which this year awarded a $1 billion dollar contract for a 60 petaflops supercomputer for meteorological analysis.

This performance could put it in the top ten of the Top500 list, and yet the Met Office’s plan makes use of the cloud. The contract has gone to Microsoft Azure, which partnered with HPE Cray.

But it’s not an ad hoc machine like Yellow Dog’s effort. This is somewhere between a dedicated supercomputer and a cloud offering.

Cloud HPC providers the best of both worlds?

The Met Office’s HPC jobs will be run in Microsoft Azure cloud facilities which are not open to access by anyone else, and are combined with extensive on-premises systems from HPE Cray.

“Microsoft is hosting the multiple supercomputers underlying this service in dedicated halls within Microsoft data centers that have been designed and optimized for these supercomputers, rather than generic cloud hosting,” Microsoft told DCD in a statement.

“This includes power, cooling, and networking configurations tuned to the needs of the program, including energy efficiency and operational resilience. Thus, the supercomputers are hosted within a ‘dedicated’ Microsoft supercomputing facility for this project.

“However, that supercomputing facility sits within an overall cloud data center site. This brings the best of both worlds – the cost-optimized nature of a purpose-built supercomputing data center along with the agile opportunities offered by integration with Microsoft Azure cloud capabilities.”

Microsoft makes a strong pitch - and one that has convinced many in the industry, as the movements of significant staff make clear.

When HPE acquired storied supercomputing company Cray for $1.3bn in 2019, a notable number of senior employees left to join Microsoft, including CTO Steve Scott, and exascale pioneer Dr. Daniel Ernst. Others have also left the company for pastures new, including CEO Pete Ungaro and senior software engineer David Greene.

A huge driving force for the Met Office was Microsoft’s potential integration with cloud computing. The Met Office supercomputer is, in essence, an on-prem supercomputer hosted in a Microsoft data center. It holds its own storage capabilities, while also being about to leverage those offered by the cloud.

However, this is a decision born out of necessity, and one that we will see made more and more, according to Spencer Lamb, the COO of Kao Data, a hyperscale provider hosting HPC infrastructure from a campus in Harlow, North London.

“It's how things will move on and it's how things will happen because, ultimately, the Met Office and other organizations of their ilk cannot build a 20MW data center on their existing campus because it physically won't happen.

“They can either go and utilize a colocation facility and go and buy the computing infrastructure and do it in that fashion. Or, they can outsource it to someone like Microsoft.”

Cambridge-1 Nvidia 3.jpg
– Nvidia

The field of HPC has become so collaborative that there are strong fears that UK research could fall behind since the country detached from the European Union which has strong shared supercomputing initiatives.

Without EU partnership, the UK at least needs to organize its own actions, according to the Government Office for Science, which released a review of large-scale computing, ‘Large-scale computing: the case for greater UK coordination.’

The report called for a single unified national roadmap and policy direction around its supercomputing capabilities in order to further research capabilities, and reach the goal of a 20MW exascale supercomputer for the nation in the 2020s.

While a noble goal, there are questions over the practicality of building these facilities. As stated in the report: “A single exascale system in the 40MW range would consume roughly 0.1 percent of the UK’s current electricity supply, the equivalent of the domestic consumption of c. 94,000 homes.” Even at the goal of 20MW, the impact is significant.

That power has to come from somewhere, and the energy cost of data centers is in danger of becoming a political issue. Ireland, Singapore, and Amsterdam have imposed de facto moratoriums, followed by tight regulations, and grids are even struggling to meet demand in the world’s largest data center hub in Northern Virginia.

The Greater London Authority (GLA) issued a warning that data center power projects in West London have annexed so much electrical power capacity, that future large house building projects may be unable to get connections.

If HPC can be hosted in data centers that already have the capacity, not to mention the technology and cooling equipment needed, the supercomputing problem could become much simpler.

Another way for HPC - colocation?

Cloud-based HPC is one option, but there’s another alternative: colocation, where the customer owns the hardware, but puts it in a shared space.

Spinning up HPC on demand in the cloud can be a simple option, but its costs can become large and uncontrollable, warns Kao Data.

In a white paper, the North London provider compares the cost of HPC in the cloud versus the cost of buying the hardware yourself and hosting it in a colocation facility - and reckons the cloud could cost 20 times as much.

“For the colocation facility, the cost of a [Nvidia] DGX-1 machine and its storage plus switching is on the order of $238,372. If you round that up and depreciate it using a straight-line method over two years, that’s $10,000 a month. Then, add in 10 kilowatts of power and colocation rent, and that is another $2,000 a month or so.

“On AWS, a DGX-1 equivalent instance, the p3dn.24xlarge, costs $273,470 per year on-demand and $160,308 on a one-year reserved instance contract. Comparably, Microsoft Azure charges about 30 percent less for an equivalent instance, but AWS is the touchstone in the public cloud. Add in AWS storage services to drive the AI workloads, and it is around a cool $1 million to rent this capacity for two years.”

So did the Met Office get burnt? Probably not, as the Met Office’s deal was awarded to Microsoft after a lengthy public procurement tender (which was challenged by Atos in court). It’s a long-term deal, with better financial terms than renting instances by the hour.

Kao’s Lamb hopes to offer a space for those who still want their own HPC infrastructure, without the hassle of building a warehouse and finding power and cooling. “We've set ourselves out to be somewhere where they can put these systems and rely upon them being looked after in the way that's needed for them to be looked after,” he said.

“They can then go in and do their research, rather than trying to build data centers within their own campus, which ultimately is something they are not very good at doing because they're not experts in that field.

“As these systems grow in size and scale, to be able to build a data center to house a very power-hungry supercomputer becomes increasingly challenging. They can buy a supercomputer within a period of months, but it will take probably two to three years to build a data center around it.”

Kao’s Harlow campus provides 8.8MW of IT load in a single building, and there will be four buildings on the campus once it's fully complete.

The HPC field has always pushed the boundaries of technology, so Kao is promising more advanced options than standard colocation offerings, including liquid cooling which has become de rigeur in the higher rankings of the Top500.

“Because of the high-power nature of the systems, what we are working through at the moment is bringing a water coolant to the chip. So there's a combination of traditional air-cooled, as well as bringing direct cooling to the technology as well. That hybrid approach is something that we very much see as the future and it's necessary for an organization like us with the ambitions we have.”

The company scored an early win with Nvidia, which wanted a supercomputer in the UK - nominally to help with healthcare research, but also as part of its failed lobbying effort to win approval from the government to allow it to acquire Cambridge-based Arm.

Cambridge-1 was the UK’s fastest supercomputer when it launched in 2019, but it has since been surpassed by The University of Edinburgh’s in-house Archer2 system.

Global comparison

However, it pays not to read too much into the microcosm that is the UK, where a few petaflops are counted as a big deal.

To get a more global view, Jerry Blair, co-founder and SVP of strategic sales at US data center provider DataBank, as well as the company's SVP of managed services, Jeremy Pease.

“We are seeing higher density cabinets,” says Blair.” It's taken a long time for the average to start going up above 5-6kW per cabinet but over the last year or two, the chipsets have gotten to a price point now where they can put so many chips in a cabinet that it now requires more power.

“We're seeing a lot more requests for over 10kW and up to 20kW of capacity to be delivered to a cabinet and even higher than that. In several cases that we're working on up to up to 50kW in a cabinet.”

As densities start reaching this level, data centers have to be specifically designed to manage the cooling requirements. DataBank has turned to water-cooled back cabinet doors which take cold water right to the CDU (cooling distribution unit) in the cabinet.

But what is perhaps most important for DataBank has been the realization that many more customers have HPC needs than ever before.

“We're seeing a lot more GPU use at a high density that I would term HPC or supercomputing. We're seeing it from the universities, and we're seeing it from a lot of standard enterprise clients actually,” continues Blair.

“They’re not putting everything in at that density, but if they have 100 cabinets that are 5kw to 10kW cabinets, they may have five cabinets that are 25kW to 50kW cabinets, which are more GPU-based, for particular projects that they're working on.”

It is because of this, as well as supply chain issues, that DataBank is seeing the need for a different approach to providing HPC services to clients, and is introducing a bare metal product.

“We're in this dynamic where gear is hard to get, and networking gear is one of the hardest things to get and you can have nine to 12-month lead times just to be able to get the networking gear that can run all that equipment,” explains Pease.

“So that's why we're launching our bare metal products, which are meant to have GPU capabilities, where we actually have this stuff stocked and ready to go, we have the networking gear and equipment in place and core facilities where we can manage it.

“With the gear that we have, we can get the high-end chipsets, like the GPU chipsets that can manage as high as they want to go. If they want to go 50kW per rack, we've got the chipsets that can enable that, we've got the processors, we’ve got the cores, we've got the RAM,” says Pease.

“Unless they're talking something super high-end with a special very special configuration, we should be able to manage that within the chipsets that we've got on the GPU side.”

It is those ‘super high-end’ projects which are the problem. With the variety of options now available, it really doesn’t make sense for many to turn to dedicated supercomputing centers to run their HPC workload, but when it comes to those specific use cases - like the Met Office - building these facilities at scale becomes a real issue.

HPCs uncontrolled power demands

Whether housed in purpose-built facilities, in colocation buildings, or in the rarified atmosphere of the cloud, all these petaflops need to run on hardware that has a demand for power and cooling.

Wherever it is located, HPC will need a close consideration of the power it uses, and the cost to the planet (and its owner’s pocket).

As Microsoft’s Met Office announcement put it: “There is also a prudent need to minimize those costs where practical, both in terms of money and – perhaps more critically – in terms of environmental sustainability.

“For this reason, the Met Office and Microsoft, who each have long-standing commitments to environmental responsibility, have worked to ensure this supercomputing service is as environmentally sustainable as possible.”

Microsoft and the Met Office appear to be relying on renewable energy PPAs.(power purchase agreements) where the IT consumer pays for energy generation in bulk.

But, as the demand for these bigger, more powerful supercomputers continues to grow, there are many ways to rein in power use and address sustainability.

When asked about this issue, Bill Magro, chief HPC technologist at Google, told DCD that the cloud was the logical solution for greener HPC.

“The demand for HPC compute seems insatiable, and the power consumption associated with that demand continues to rise. At the same time, the HPC industry has embraced parallelism, through multi-core CPUs, GPUs, and TPUs. This parallelism enables ever higher efficiency, measured in performance/watt,” he says.

“One of the best ways to minimize the environmental footprint of compute is through highly-efficient, highly-utilized data centers, powered by clean energy,” he added, launching into a pitch for Google’s renewable energy PPAs and power matching.

When asked if there is an upper limit to what we can feasibly power, Magro, like everyone we asked, had little to offer.

Aurora
– US Department of Energy

To a certain extent, we can hope for the “law of accelerating returns” (Ray Kurzweil’s term for the way some technologies seem to improve exponentially). Perhaps as the power and capabilities of supercomputers continue to grow, our ability to make them more efficient and produce renewable energy will keep pace.

Until then, these facilities will be limited by what Magro dubbed ‘the available power envelope.’

Goodbye to all on-premise HPC?

It is too early to declare the end of the standalone supercomputer, just as the enterprise data center has outlived most predictions. But it is no longer the obvious choice for enterprises and researchers needing access to HPC resources - the cloud offers easy access, with potential sustainability benefits.

For other deployments, colocation and bare metal can fill the need - as long as the facility can meet the increasing power and cooling demands.

That leaves the ‘leadership’ systems, like Frontier, which capture the headlines and are at the forefront of what’s possible in the industry.

"You can stick a full Cray EX system, identical to what you might find at NERSC or OLCF, inside Azure nowadays and avoid that whole burdensome mess of building out a 50MW data center," Lockwood said in his resignation post.

Why, he asked, should the Department of Energy spend billions on the next wave of ginormous supercomputers? Government agencies have already begun to shift traditional workloads to the cloud, cutting down on sprawling data center portfolios that were deemed inefficient and expensive.

“That all said,” he admitted. “The DOE has pulled off stranger things in the past, and it still has a bunch of talented people to make the best of whatever the future holds."