Nvidia has set new MLPerf performance benchmarking records on its H200 Tensor Core GPU and TensorRT-LLM software.

MLPerf Inference is a benchmarking suite that measures inference performance across deep-learning use cases.

The latest version of the benchmarking suite – MLPerf v4 – has seen the addition of two new workloads that represent generative AI use cases: a large language model (LLM) benchmark based on Meta’s Llama 2 70B, and a text-to-image test based on Stable Diffusion XL.

Nvidia
Nvidia H200 Tensor GPU – Nvidia

Nvidia has set performance records on both new workloads, providing the highest performance across all MLPerf Inference workloads in the data center category.

The company’s TensorRT-LLM is an open-source software library developed to double the speed of inferencing LLMs on its H100 GPUs. Across the MLPerf v4 GPT-J test, the H100 GPUs using TensorRT-LLM achieved speedups of 2.4x and 2.9x in the offline and server scenarios, compared to the performance provided by the GPUs six months earlier during the v3.1 test.

For the MLPerf Llama 2 70B benchmarking test, Nvidia’sTensorRT-LLM running on the company’s H200 GPUs delivered up to 43 percent and 45 percent higher performance compared to the H100 in the server and offline scenarios, respectively, when configured to a 1,000W TDP.

The new benchmark uses the largest version of Llama 2, which has 70 billion parameters and is over ten times larger than the GPT-J LLM model that was used in previous benchmarking tests.

Regarding the Stable Diffusion XL text-to-image benchmarking test, an 8-GPU Nvidia HGX H200 system with GPUs configured to a 700W TDP, achieved a performance of 13.8 queries/second and 13.7 samples/second in the server and offline scenarios, respectively.

When the same test was run using a system containing eight Nvidia L40S GPUs, the system demonstrated performance of 4.9 queries/second and 5 samples/second in the server and offline scenarios, respectively.

Nvidia said that this was the best performance achieved by any hardware solution during the Stable Diffusion XL test.

Speaking ahead of the results being made public, Dave Salvator, director of product marketing with the Nvidia accelerated computing group, said that inference has become a big part of Nvidia’s data center activities and business, noting that in 2023 it made up about 40 percent of the company’s data center revenue.

However, he noted that is not just in the two new MLPerf tests that Nvidia has posted record performance outcomes.

“[Nvidia] submits on every workload because not only it is important to be able to deliver great performance on a single workload, it’s important to deliver great performance across as many workloads as you can,” he said. “[In the nine MLPerf tests] we continue to deliver leading results across all of those workloads, in all of those use cases.

“This is something that's really important because what it means is not only is our platform very, very performant, it's also very versatile, and that's something our customers really appreciate and value,” Salvator said.