How machine learning could change science

Scientific progress is inherently unpredictable, tied to sudden bursts of inspiration, unlikely collaborations, and random accidents. But as our tools have improved, the ability to create, invent and innovate has improved with them.

The birth of the computing age gave scientists access to the greatest tool yet, with their most powerful variant, supercomputers, helping unlock myriad mysteries and changing the face of the modern world.

"High performance computers are paramount towards us driving the discovery of science," Paul Dabbar, Under Secretary for Science at the US Department of Energy, told DCD. Key to recent discovery and research efforts has been the ability to run vast simulations, complex models of aspects of the real world, to test theories and experiment upon them

This feature appeared in the April issue of DCD Magazine. Subscribe for free today.

Beyond simulation

“For the last several decades, the work that we've been doing inside Lawrence Livermore National Lab has been exploiting the relationship between simulation and experiments to build what we call predictive codes,” Frederick H. Streitz, LLNL’s chief computational scientist and director of the high performance computing innovation center, said.

“We think we know how to do research in the physics space, that is, we write down the equations, solve them, and work closely with experimentalists to validate the data that goes into the equations, and eventually build a framework that allows us to run a simulation that allows us to believe the result. That’s actually fundamental to us - to run a simulation that we believe.”

Now, however, a new tool has reached maturity, one that may yet further broaden the horizons of scientific discovery.

“So on top of experiments and simulation, we're adding a third component to the way we look at our life, and that is with machine learning and data analytics,” Streitz told DCD.

“Why is that important? It's because if you look at what we do with experiments, it is to query nature to ask it what it is doing. With a simulation, we query our understanding, we query a model of nature, and ask what that’s doing. And those don't often agree, so we have to go back and forth.”

But with machine learning, Streitz explained, it “is actually a completely different way of looking at your reality. It's neither querying nature, nor is it querying your model, it's actually just querying the data - which could have come from experiments or simulation - it’s independent of the other two. It’s really an independent view into reality.”

That, he added, “is actually a profound impact on how you approach science - it approaches predictability in places where you didn't have exact predictability.”

The desire for researchers to be able to use these tools, Streitz told DCD, is “driving changes in computing architecture,” while equally changes to these architectures are “driving this work. I would say it's a little bit of both.”

It’s a view shared by many in the high performance computing community, including the CEO of GPU maker Nvidia. “The HPC industry is fundamentally changing,” Jensen Huang said. “It started out in scientific computing, and its purpose in life was to simulate from first principle laws of physics - Maxwell's equations, Schrödinger's equations, Einstein's equations, Newton's equations - to derive knowledge, to predict outcomes.

“The future is going to continue to do that,” he said. “But we have a new tool, machine learning. And machine learning has two different ways of approaching this, one of them requires domain experts to engineer the features, another one uses convolutional neural network layers at its lowest level, inferring learning on what the critical features are, by itself.”

It has begun

Already, the top supercomputers are designed with this in mind - the current reigning US champions, Summit and Sierra, are packed with Nvidia Volta GPUs to handle intense machine learning workloads. “The original Kepler GPU architecture [introduced in 2012] was designed for HPC and not AI - that was what was originally used to do the first AI work,” Ian Buck, Nvidia’s VP of accelerated computing and head of data centers, told DCD.

“We have had to innovate on the underlying architecture of the hardware platforms and software to improve both HPC and AI,” he said. That has benefited the wider computing community, as have the other innovations in the pre-exascale supercomputers.

“The good news is, these instruments are not one off, bespoke things, they're things that can be replicated or purchased or built at smaller scales and still be extremely productive to research science institutions, and the industry.”

Even now, scientists are taking advantage of the convergence of AI and HPC, with Streitz among them. His team, in collaboration with the National Institutes of Health, is trying to tackle one of the cruelest, most intractable problems faced by our species - cancer.

There are several projects underway to cure, understand, or otherwise ameliorate the symptoms of different cancers - three of which in the DOE specifically use machine learning, as well as a broader machine learning cancer research program known as CANDLE (CANcer Distributed Learning Environment).

"In this case, the DOE and [NIH's] National Cancer Institute are looking at the behavior of Ras proteins on a lipid membrane - the Ras oncogenic gene is responsible for almost half of colorectal cancer, a third of lung cancers.”

Found on your cell membranes, the Ras protein is what “begins a signalling cascade that eventually tells some cell in your body to divide,” Streitz said. “So when you're going to grow a new skin cell, or hair is going to grow, this protein takes a signal and says, ‘Okay, go ahead and grow and another cell.’”

In normal life, that activity is triggered, and the signal is sent just once. But when there’s a genetic mutation, the signal gets stuck. “And now it says, grow, grow, grow, grow, again, just keep growing. And these are the very, very fast growing cancers like pancreatic cancer, for which there's currently no cure, but it's fundamentally a failure in your growth mechanism.”

This is something scientists have known for nearly 30 years. “However, despite an enormous amount of time and effort and money that has been spent to try to develop a therapeutic cure for that, there's no known way to stop this,” Streitz said.

The mutation is a subtle one, with all existing ways of stopping it also stopping other proteins from doing their necessary functions. "The good news is that you cure the cancer, the bad news, you actually kill the patient."

Hitting supercomputers' limits

Lab experiments have yielded some insights, but the process is limited. Simulation has also proved useful, but - even with the vast power of Summit, Sierra and the systems to come - we simply do not have the computing power necessary to simulate everything at the molecular scale.

"Well that's what we're going to be using machine learning for: To train a reduced order model, and then jump to a finer scale simulation when required. But we want to do that automatically, because we want to do this thousands and thousands and thousands of times."

This was the first full scale workload run on Sierra when it was unveiled last year - running on the whole machine, across 8,000 IBM Power cores and more than 17,000 Volta GPUs.

The team simulates a large area at a lower scale, and then uses machine learning to hunt for anomalies or interesting developments, splitting up the area simulated into patches. “I can take every patch in the simulation, there could be a million of them. And I could literally put them in rank order from most interesting to least interesting.”

Then they take the top hundred or so most interesting patches, and generate a fine scale simulation. Then they do it again and again - on Sierra, they ran 14,000 simulations simultaneously, gathering statistics of what's happening at a finer scale.

Already, this has led to discoveries that “would not have been obvious except for doing simulations at the scale that we were able to do,” Streitz said, adding that he expects to learn much more.

Similar approaches are being used elsewhere, Intel’s senior director of software ecosystem development, Joe Curley, said: “The largest computers in the world today can only run climate models to about a 400km view. But what you really want to know is what happens as you start getting in closer, what does the world look like as you start to zoom in on it?

“Today, we can't build a computer big enough to do that, at that level,” he said. But again, researchers “can take the data that comes from the simulation and, in real time, we can then go back and try to do machine learning on that data and zoom in and get an actual view of what it looks like at 25km. So we have a hybrid model that combines numerical simulation methods with deep learning to get a little bit of greater insight out of the same type of machine.”

This has helped guide the design of the supercomputers of tomorrow - including Aurora, America’s first exascale supercomputer, set for 2021.

“The three things that we are very, very excited about is that Aurora will accelerate the convergence of traditional HPC, data analytics and AI,” Rajeeb Hazra, corporate VP and GM of the Enterprise and Government Group at Intel, the lead contractor on the $500m system, said.

“We think of simulation data and machine learning as the targets for such a system,” Rick Stevens, associate laboratory director for computing, environment and life sciences at Argonne National Laboratory, told DCD.

“This platform is designed to tackle the largest AI training and inference problems that we know about. And, as part of the Exascale Computing Project, there's a new effort around exascale machine learning and that activity is feeding into the requirements for Aurora.”

Exascale meets machine learning

That effort is ExaLearn, led by Francis J. Alexander, deputy director of the Computational Science Initiative at Brookhaven National Laboratory.

"We're looking at both machine learning algorithms that themselves require exascale resources, and/or where the generation of the data needed to train the learning algorithm is exascale," Alexander told DCD. In addition to Brookhaven, the team brings together experts from Argonne, LLNL, Lawrence Berkeley, Los Alamos, Oak Ridge, Pacific Northwest and Sandia in a formidable co-design partnership.

LLNL’s informatics group leader, and project lead for the Livermore Big Artificial Neural Network (LBANN) open-source deep learning toolkit, Brian Van Essen, added: “We're focusing on a class of machine learning problems that are relevant to the Department of Energy's needs… we have a number of particular types of machine learning methods that we're developing that I think are not being focused on in industry.

“Using machine learning, for example, for the development of surrogate models to simplify computation, using machine learning to develop controllers for experiments very relevant to the Department of Energy.”

Those experiments include hugely ambitious research efforts into manufacturing, healthcare and energy research. Some of the most data intensive tests are held at the National Ignition Facility, a large laser-based inertial confinement fusion research device at LLNL, that uses lasers to heat and compress a small amount of hydrogen fuel with the goal of inducing nuclear fusion reactions for nuclear weapons research.

“So it's not like - and I'm not saying it's not a challenging problem - but it's not like recommending the next movie you should see, some of these things have very serious consequences,” Alexander said. “So if you're wrong, that's an issue.”

Van Essen concurred, adding that the machine learning demands of their systems also require far more computing power: “If you're a Google or an Amazon or Netflix you can train good models that you then use for inference, billions of times. Facebook doesn’t have to develop a new model for every user to classify the images that they're uploading - they use a well-trained model and they deploy it.”

Despite the enormous amount of time and money Silicon Valley giants pump into AI, and their reputation for embracing the technology, they mainly exist in an inference dominated environment - simply using models that have already been trained.

“We're continuously developing new models,” Van Essen said. “We're primarily in a training dominated regime for machine learning… we are typically developing these models in a world where we have a massive amount of data, but a paucity of labels, and an inability to label the datasets at scale because it typically requires a domain expert to be able to interpret what you're looking at.”

Working closely with experimenters and subject experts, ExaLearn is “looking at combinations of unsupervised and semi-supervised and self-supervised learning techniques - we're pushing really hard on generative models as well,” Van Essen said.

Take inertial confinement fusion research: “We have a small handful of tens to maybe a hundred experiments. And you want to couple the learning of these models across this whole range of different fidelity models using things like transfer learning. Those are techniques that we're developing in the labs and applying to new problems through ExaLearn. It's really the goal here.”

From this feature, and the many other ‘AI is the future’ stories in the press, it may be tempting to believe that the technology will eventually solve everything and anything. “Without an understanding of what these algorithms actually do, it's very easy to believe that it's magic. It's easy to think you can get away with just letting the data do everything for you,” Alexander said. “My caution has always been that for really complex systems, that's probably not the case.”

“There's a lot of good engineering work and good scientific exploration that has to go into taking the output of a neural network training algorithm and actually digging through to see what it is telling you and what can you interpret from that,” Van Essen agreed.

Indeed, interpretability and reproducibility remains a concern for machine learning in science, and an area of active research for ExaLearn.

One of the approaches the group is studying is to intentionally not hard-code first principles into the system and have it "learn the physics without having to teach it explicitly," Van Essen said. “Creating generalized learning approaches that, when you test them after the fact, have obeyed the constraints that you already know, is an open problem that we're exploring.”

This gets harder when you consider experiments that are at the cutting edge of what is known, where the reference points one can test the system’s findings against become ever fewer. “If you develop some sort of machine learning-informed surrogate model, how applicable can that be when you get to the edges of the space that you know about?” Los Alamos machine learning research scientist Jamal Mohd-Yusof asked. “Without interpretability that becomes very dangerous.”

Even with the power of exascale systems, and the advantages of machine learning, we’re also pushing up against the edges of what these systems are capable of.

“We can't keep all the data we can generate during exascale simulation necessarily,” Mohd-Yusof said. “So this also may require you to put the machine learning in the loop, live as it were, in the simulation - but you may not have enough data saved.

“So it requires you to design computational experiments in such a way that you can extract the data on the fly.”

That also begs a deeper question, Van Essen said: “If you can't save all the data and you're training these models, that does actually imply that sometimes these models become the data analytic product output from your simulation.” Instead of you being able to learn everything from the output of the model, your insights are found buried in the model itself.

If you have “two trained models from two different experimental campaigns or scientific simulation campaigns, how do you merge what they've learned if the model is your summary of it?”

These questions, and so many more, remain unanswered. As with all discovery, it is hard to know when we will have answers to some of these problems - but, for Streitz, the future is already clear.

“This notion of this workflow - using predictive simulation at multiple scales, and using machine learning to actually guide the decisions you're making with your simulations, and then going back and forth - this whole workflow, we believe that's the future of computing,” he said.

“It is certainly the future of scientific computing.”

Keep reading

13 Nov 2018

24 petaflops HPE supercomputer to feature AMD Epyc Rome CPUs
17 Oct 2018

Probing the universe: Upgrading CERN
17 Jan 2019

The creation of the electronic brain

How machine learning could change science

Beyond simulation

It has begun

Further reading

The race to exascale: A story of superpowers and supercomputers

Nvidia launches DGX data center program, with Digital Realty, Switch, EdgeConneX & more

The PlayStation Supercomputer

Hitting supercomputers' limits

Exascale meets machine learning

Keep reading

24 petaflops HPE supercomputer to feature AMD Epyc Rome CPUs

Probing the universe: Upgrading CERN

The creation of the electronic brain

Tags

AI Imperatives: Prioritizing your AI infrastructure choices

Rethinking data center service with an AI-driven systemic asset management strategy

The Power of DCIM to Enable Automation

The Power Conundrum: Cooling to the Rescue?