In the past few years, supercomputing has gone fully mainstream, presenting new opportunities for data-driven businesses across numerous sectors. This does not mean, however, that supercomputing giants have sold out – in fact, far from it.
Today the world’s most powerful and cutting-edge computing machines are still involved in hard science. From experiments in physics at the Large Hadron Collider to rapid advances in computational biology that have fast tracked the development of multiple Covid-19 vaccines in record time, each new discovery requires more power and ever more-advanced hardware to deliver transformative outcomes.
However, lately a new class of supercomputer user has emerged - those that acknowledge supercomputing as a valuable tool for business. The chief driver for this is artificial intelligence (AI), which is becoming instrumental within business and digital transformation.
There are plenty of industries looking to take advantage of the new generation of AI and machine learning applications. They include financial institutions that need to support quantitative data analysis, to logistics companies that hope to automate supply chains with the help of IoT and computer vision.
Analyst firm Omdia estimates there are at least 200 unique business cases for machine learning, many of which have nothing in common, besides requirements for specific software and specialized hardware. If you’re buying AI functionality off-the shelf, then that is rarely a problem. However, if you’re trying to build something new and original with AI, the infrastructure requirements can be daunting.
It’s also not just the compute that is required to train a new machine-learning model. Without constant re-training and new data, a model can quickly lose accuracy with the resulting product or service, fast losing its usefulness. Machine learning for business is, therefore, a long game. Hence, the need for a new class of AI-focused supercomputers, and data centres that can support them.
Not just any supercomputer
Supercomputers are essentially lots of powerful computers and pieces of infrastructure equipment networked together, which collectively focus on finding the solution to a complex task. To make them more practical and quicker to deploy, in recent years the elements have been designed to ‘click together’ in standardized, scalable units that include GPU-intensive infrastructure, power, cooling, all linked with fibre.
Years ago, it might have taken two to three years to design, build and take a large-scale supercomputer from inception to deployment - and such timeframes used to be acceptable. However, neither business, academia nor life science research, can live with one to two year wait times any longer. Hardware providers have therefore taken notice and today the same supercomputing infrastructure can be achieved in as little as 20 weeks.
What’s interesting is that building a supercomputer in 20 weeks is not a marketing boast. In fact it’s fast become a reality through Nvidia’s Cambridge-1, which is the UK’s most powerful supercomputer. This machine is dedicated to a very pertinent field of research – healthcare – with GlaxoSmithKline, AstraZeneca, Guy’s and St Thomas’ NHS Foundation Trust, King’s College London, and Oxford Nanopore among its founding partners.
Indeed, Nvidia has been spearheading the new wave of enterprise-friendly high performance computing (HPC) infrastructure with its SuperPOD: a reference architecture for DGX A100 System servers. Each DGX A100 is essentially a pre-integrated supercomputing building block that occupies six rack units. Add some networking switches and storage, and you’ve got yourself a small and very capable supercomputer. Add a few more nodes, and your supercomputer will be a lot faster. Add 20 nodes, and you’ll get yourself a whole Scalable Unit (SU), but add 80 nodes or more, and you have something the size of Cambridge-1.
Click and play – Supercomputing Lego
The Lego analogy is something that’s long been discussed in regard to supercomputing and data centres. In a recent interview, which discussed the evolution of its SuperPOD, the subject was again illuminated by Marc Hamilton, vice president of Solutions Architecture and Engineering at Nvidia.
“Along the way, we realized that not everyone needs, or can afford, to start off with 560 DGX systems. And so, we wanted to have a reasonable building block,” Hamilton explained . “In HPC parlance, a building block is often called a scalable unit. It’s a cookie-cutter design, where you want these Legoblocks that are a collection of servers, which can be connected together and can scale up.”
According to Hamilton, only 20 servers is fast enough to create an HPC system that will get you on the list of the worlds 500 fastest supercomputers. Yet what’s interesting is that data centre manufacturers have been embracing a similar Lego-style approach through modular systems for some time. One might argue that Nvidia have done what they do best, and taken an existing idea, transforming it into yet another industry-leading solution that offers greater performance capabilities than its peers.
Interestingly, this Lego style approach is not limited to hardware, but is now observed at all levels of the infrastructure stack. Today it is being adopted at the data centre level via OCP-Ready systems and OCP Accepted hardware; at supercomputing level via Nvidia’s DGX SuperPOD; on desktops via the A100 DGX Station; and at the application level through Linux-based application containers.
In recent years, containerized applications, managed through open-source tools like Kubernetes, have also taken the world by storm. Open-source tools follow the same modular approach, with full-feature services built out of scalable, entirely self-contained, interchangeable ‘microservices’, which are yet another type of building block to consider.
Building block standards
The Lego concept is not solely focused on the servers, but the infrastructure environment in which it’s hosted. There is, in fact, a whole movement towards standardization, providing uniform building blocks across the infrastructure layer that drives performance, reduces complexity, and increases speed of installation. The record-breaking deployment of Cambridge-1, for example, wouldn’t be possible without greater standardization across both the operating environment and infrastructure.
Standards remain an essential part of the data centre industry, but few other initiatives have been as radical as the Open Compute Project (OCP). Data centres that carry the OCP-Ready certification have been fine-tuned to deliver technical excellence for industrial-scale intensive computing environments, and are ready for anything from typical enterprise workloads to HPC-ready equipment, with power densities reaching up to 100kW.
Today supercomputing has it all: there are standards in place, frameworks and libraries, which are now enterprise-ready, and supported by a broad ecosystem of hardware and software developers. The key to transformative supercomputing, however, remains in designing systems that are resilient, powerful and quick to deploy. And of course, in the user applications and data that they host.
Supercomputing power doesn’t have to be complicated to deploy, and the faster we can build and install it, the sooner can entire industries benefit from the immense power and resolutions it can provide.