Since man first looked to the skies, we have tried to comprehend the cosmos.

But peering outwards is just one way to help understand the universe. Another answer lies within, in the highly detailed simulations now possible thanks to profound microprocessor advances and decades of investment in high-performance computing (HPC).

In the north of England, one such supercomputer hopes to do its part to present a history of the universe in unprecedented detail, providing new insights into how we came to be.

This feature appeared in the latest issue of the DCD Magazine. Read it for free today.

Earlier this year, as the UK suffered an unprecedented heatwave caused by climate change, DCD toured Durham University’s data center at the Institute for Computational Cosmology (ICC), and learned about its most powerful system, the Cosma-8.

Cosma-8 is part of the UK government’s Distributed Research using Advanced Computing (DiRAC) program, which is formed of five supercomputers around the country, each of which has a specific unique feature.

Issue 46 - The Last Data Center

Long-term data storage enters a new epoch

07 Nov 2022

In the case of the Durham system, that differentiating factor is its breathtaking amount of random-access memory (RAM).

"For the full system, we have 360 nodes, 46,000 cores, and very importantly for us a terabyte of RAM per node - that's a lot of RAM," Dr. Alastair Basden, head of the Cosma service, said.

Two nodes in the system go even further, cramming in 4TB of RAM per node. "These are for workloads that don't scale as well across multiple nodes. So things like accessing large data sets and code which aren’t very well parallelized," Dr. Basden said.

This huge amount of RAM allows for specific scientific problems to be addressed that would otherwise not be possible on conventional supercomputers.

But more on that later, first a quick rundown of the system's other specs: It boasts dual 280W AMD Epyc 7H12 processors per node with a 2.6GHz base clock frequency and 64 cores, installed in Dell Cloud Service C-series chassis with a 2U form factor. It also has six petabytes of Lustre storage, hosted across 10 servers that have their own two CPUs and 1TB of RAM.

The supercomputer uses direct-to-chip cooling, and a CoolIT CDU.

You may notice a distinct lack of GPUs, despite their usefulness in a number of other simulation-based systems.

"Basically the codes that we're doing don't match well to GPUs. There are efforts that are going on to port these codes to GPU, but the uplift you can get in performance is a small factor rather than large," Dr. Basden said.

However, the data center is home to a two-node cluster funded as part of the UK's Excalibur exascale efforts that have six AMD MI100 GPUs in it. "MI200 GPUs should follow shortly," the researcher added.

Cosma-8, however, has no plans for GPUs, instead aiming to push CPUs and RAM to their limits, fully connected by a PCIe-4 fabric. “Although our system isn't as big as many of the larger systems, because we have this higher RAM per node, we can actually do certain workloads better,” Dr. Basden said.

One example is the MillenniumTNG-XXL simulation, which aims to encapsulate the large-scale structure of the universe across 10 billion light years. “It's basically the largest simulation of its type that can be done anywhere in the world,” Dr. Basden said.

“So this is 10,240³ dark matter particles, this is trillion particle regimes - a large step up from anything simulated previously,” he said, “You can begin to see within the simulations it actually building spiral galaxies and things like that, all from the physics that we put in.”

The simulation takes data from telescopes, satellites, and the Dark Energy Spectroscopic Instrument (DESI) to see “how well we can match what we get in our simulator to what is actually seen in the sky,” Dr. Basden explained. “That then tells us more about dark matter.”

The MillenniumTNG-XXL simulation began in July last year, taking up a huge amount of computing resources. “We dropped about 60 million CPU hours on that,” Dr. Basden said.

“A large amount of memory per node is absolutely essential. HPC codes don't always scale efficiently, so the more nodes you use the more your scale goes down. Your simulation would take longer and longer to run until you reach a point of no return. So it wouldn't have been possible without a machine designed specifically for this."

Dr. Azadeh Fattahi is one of the researchers trying to take advantage of the machine’s unique talents, seeking to understand the importance of dark matter in the formation and evolution of the universe.

"There's actually more dark matter than normal things in the universe," the assistant professor at UKRI FLF in Durham's department of physics, said.

“The normal matter - which is what galaxies are made out of, along with the Solar System, planets, us, everything in the universe that we can observe, basically - includes only a small portion of the matter and energy in the universe.”

Visible matter makes up just 0.5 percent of the universe, with dark matter at 30.1 percent. The final 69.4 percent is dark energy.

To understand how these forces interact requires enormous computing power. “Earlier efforts only looked at dark matter distribution and ignored the more complex systems,” Dr. Fattahi said.

“But we want to include more complex phenomena in the models that we're using,” she explained. “Now on Cosma-8, we can basically run a full hydrodynamical simulation, which means we include all the complex procedures like gas pools, stars forming and exploding into a supernova, as well as supermassive black holes.”

One of the flagship projects on Cosma-8 is the ‘Full-hydro Large-scale structure simulations with All-sky Mapping for the Interpretation of Next Generation Observations’ study, or, as it is more commonly known, the FLAMINGO simulation.

“So FLAMINGO is at the cutting edge,” Dr. Fattahi said. “MillenniumTNG-XXL is a slightly bigger volume, but doesn't have hydrodynamics. Compared to anything that has been done with hydrodynamics it is the biggest in the world.”

FLAMINGO’s simulated universe is about 8 billion light years, featuring 5,000³ elements of dark matter and 5,000³ of gas. “This is the largest number of resolution elements that have been run on a hybrid simulation anywhere in the world,” she said. It took most of Cosma-8 working for 38 straight days to finish.

Dr. Fattahi's team uses these giant models to then zoom in to work at a comparatively smaller scale, operating at ‘simply’ the galactic level. By choosing a smaller chunk of space, she can focus the computational power while keeping the rest of the universe at a lower resolution.

“I study the low mass, the very small dark matter clumps, and dark matter halos,” she said. “It turns out that small galaxies have a lot of dark matter, they are the most dark matter dense dark galaxies in the universe. The question that derives my research is what we can learn from the small-scale structures about the nature of dark matter, which is a fundamental question in physics.”

Even in these smaller simulations, the scale is still immense. Astronomers use solar mass as a unit of measurement, with one solar mass equal to that of our sun. “The target resolution is about 10⁴ solar mass,” Dr. Fattahi said. “FLAMINGO has a resolution of 10⁸.”

Again, this simulation would not have worked without the high RAM, Dr. Fattahi argued. “If we go over too many connections, the lines become quite slow, so we have to fit these simulations in as small a number of nodes as possible,” she said. “The 1TB per node allowed us to fit it into a couple of nodes, and then we could run many of them in parallel. That’s where the power of Cosma-8 lies.”

The hope is to make it more powerful if and when more money comes in. The exact roadmap to that new funding is not known - when we visited the facility the UK government was in turmoil, and as this article goes to print it is in a different turmoil - but Dr. Basden is confident that it is on its way.

Cosma-8 was funded under DiRAC II, with the scientific community building a case for III. "We put it to the government," Dr. Basden said. "They said 'fantastic, but there's no funding.'

"Year after year, we're waiting for this money. Finally, at the end of 2020 they said ‘you can have some of the DiRAC III funding, but not all of it.’ We're still waiting for the rest, hopefully it will come this year, maybe next.”

When the money comes, and how much they get, will define phase two of the system. Lustre storage will likely be doubled, and it will probably use AMD Milan processors.

"Depending on timescales, we might get some [AMD] Genoa CPUs, where we think we could go up to 6TB of RAM per node," Dr. Basden said. "And we have use cases for that."

When that happens, the data center will be set for a reshuffle. The data hall currently holds Cosma-6, -7, and -8 (with -5 in an adjacent room), but is at capacity.

"Cosmos-6 will be retired. Its hardware dates from 2012 and it came to us secondhand in 2016."

Each system draws around 200kW for compute on a standard day, with around 10 percent more for the cooling demands.

"Sometimes there can be heavy workloads and it reaches about 900kW; our total feed to the room is 1MW. So we're getting close to where we wouldn't want to have much more kit without retiring stuff. Yesterday we saw a 90 percent load."

That day, the hottest on record in the country, taxed data centers across the UK. It brought down Google and Oracle facilities, but the Cosma supercomputers ticked on unperturbed, Dr. Basden said proudly.

"We survived the hottest temperature," he said. "Most of the time we use free air cooling, but days like that we use an active chiller. That means that most of the year we have a PUE (power utilization effectiveness) which is about 1.1, which is pretty good, and then it can get up to around 1.4."

It has not always been easy, however, he admitted. "For the last year or so, the generator wasn't kicking in, so if the grid had gone down we only had an hour on the UPS. Fortunately, that didn't happen."

The generator is now fixed, but there still exists another risk: "The chillers are not on UPS, so if the power dies the UPS will take over compute and the chillers will have to be brought back to life by the generator," he said.

"That doesn't always happen. Once I was sitting in the data hall and the chillers went down when we were testing stuff," he recalled. "I was like a frog in boiling water. I was just sitting there getting a bit warmer and warmer. It got quite warm in here, and I was like ‘guys what's happening?’ A circuit breaker had tripped."

Part of that lack of perfect redundancy is down to requirements- unlike a commercial provider that cannot go down, research supercomputers can be more lenient with downtime (for example it has three multi-day maintenance outages a year), so universities are better off spending the money on more compute than more redundancy.

Another issue is the location: The data halls are within the Institute for Computational Cosmology, a larger building built for students and researchers.

In the future, as they plan to move into the next class of computing, the exascale, they will have to look elsewhere. "We are going to need to build a new data center for that,” he said. “I think we are looking at a 10-15MW facility, which by the time we get to exascale is achievable," he said.

The only official exascale system currently is the US' Frontier supercomputer, which has a peak power consumption of 40MW (but usually is closer to 20MW). However, by the time the UK government will fund such a system, advancements will have brought that power load down.

By then, we might also have a better understanding of how the universe works, with scientists around the world now turning to the simulations built at Cosma-8 to help unpick the complexities of our cosmos.

Simulating the flamingo universe, and other challenges at trillion-particle scales

Issue 46 - The Last Data Center

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence