I recently wrapped up filming the Supermicro Open Storage Summit 2024 (airing from August 13 through August 29) and I had some time to reflect on the changes going on in the storage industry.
With 18 companies participating, Supermicro had all the leading processor and GPU providers (Nvidia, AMD, and Intel), the key storage media companies (Kioxia, Western Digital, Micron, Solidigm, Seagate) and some of the leading software-defined storage companies (WEKA, VAST Data, DDN, Nutanix, Quantum, OSNexus, Cloudian, Graid, and Memverge) speaking. Here are the key themes that I observed.
First, every speaker in each session discussed AI, either as a critical workload and application or how it was changing the infrastructure supporting these workloads. This was despite my initial concept of the summit to have each session focused on the key traditional storage workloads including HCI & enterprise, HPC, media and entertainment, and others. I had originally thought that we would cover AI workloads in a separate AI storage session. It’s clear now that AI has permeated every workload and is a feature in many applications no matter which industry was discussed.
Whether it is generative AI used for content creation in media and entertainment or traditional AI used for automatic content classification, tagging, and transcription in a media asset manager, AI is in the initial stages of transforming all enterprise workloads.
Amdahl’s Law describes how performance gains are related to the fraction of time used by the improved part of the computer system and illustrates how bottlenecks will migrate to another area once one subsystem is optimized. This was on full display as each part of the solution stack from the system, processor, network, SSD, HDD, and software all strived to not be the performance bottleneck in a highly connected and complex environment. This spoke to the need for a system-level approach to validation testing and performance optimization which Supermicro as the system solution provider demonstrated its work in all these solutions.
AI storage session
In the AI storage session, Cloudian’s Jon Toor described how object storage is not only used in the traditional data lake as the AI data repository but is now starting to be used directly for AI training when deployed on an all-flash infrastructure.
WEKA’s Shimon Ben-David discussed how a mix of small and large file sizes and read and write patterns are often mixed unpredictably when multiple AI training workloads are simultaneously run. This mix is handled through a high-performance parallel file system. Micron’s Steve Hanna talked about how these workloads require different classes of SSD and NAND technologies including write-optimized TLC and read-optimized QLC for large-capacity SSDs.
Nvidia’s Rob Davis described how essential network performance is to the overall system performance and how network speeds for both Ethernet and InfiniBand are moving to 800 Gbps. He described the new Nvidia GPU-initiated storage protocol which fully bypasses the CPU in storage data transfers across the network replacing the GPU Direct Storage protocol.
Nvidia’s Bluefield-3 data processor unit (DPU) on both ends of the storage connection enables offload from the CPU and increased security of the storage stack through user-space isolation.
Supermicro’s William Li shared a proven and deployed reference architecture using all three partners’ technologies in a two-tiered storage implementation which was optimized for high-performance flash workloads as well as high-capacity data lakes.
Media and entertainment storage panel
In the media and entertainment storage panel, AMD’s Paul Blinzer kicked things off with how “times, they are ‘a changing” largely due to the impact of AI on content creation and workflows. AMD is addressing this challenge through a combination of traditional CPU performance improvements and high-performance GPU AI accelerators.
Sherry Lin discussed how Supermicro’s storage and compute platforms are meeting this need through the largest available portfolio of server and storage products available from any manufacturer. Praveen Midha from Western Digital spoke about the AI data lifecycle and how different classes of SSDs are required in different phases from content ingestion through to inference and content creation to optimize for the different read/write and capacity characteristics of the phase.
Finally, Quantum’s Skip Levens tied it together with a discussion of how Quantum’s software-defined storage and workflow products, including its high-performance file system, Myriad, and capacity-optimized ActiveScale object store, are connected through the Media Asset Manager CatDV product to allow content producers who are continuously producing content such as sports teams to automate the data management aspect of their workflow.
Hyper-converged infrastructure session
In the hyper-converged infrastructure (HCI) session, Nutanix’s Oscar Walhberg described how new hardware platforms from Supermicro, new processor capabilities from Intel, and high performance and large capacity SSDs from Western Digital have transformed HCI from five years ago and allowed Nutanix to run any application which can be virtualized including databases and AI inferencing through a recently announced “GPT-in-a-Box” offering.
Intel’s Christine McMonigal described how the HCI landscape was changing through mergers and acquisitions and how environmental sustainability was a key driver to more energy-efficient data center infrastructure provided by HCI solutions.
I described how Supermicro provides a wide range of storage and server platforms which enables Nutanix and other workloads to be deployed with the exact requirements and SLAs needed while testing and optimizing all the technologies.
Cloud service provider and AI session
In the cloud service provider (CSP) and AI session, Supermicro’s Ben Lee started the discussion by talking about how eight-way GPU servers connected in large rack-level clusters are the basis of AI large-scale AI training and how liquid cooling using direct-to-chip technology is now essential to both handle the high thermal output of the latest GPUs and remove heat efficiently from the data center to enable the lowest possible data center PUE.
VAST Data’s Neeloy Bhattacharyya talked about how the new generation of AI-focused CSPs need to differentiate themselves with data services which will enable high margins, a key CSP metric. Sachin Hindupor highlighted AMD’s contributions to CSP solutions including lower watt/core performance metrics as well as higher core count and more PCIe Gen 5 lanes to enable large-scale multi-tenant CSP environments.
Bill Panos described how Solidigm’s 30.72TB QLC SSDs filled the need for cost-effective high-capacity flash storage while TLC products met the need for more transactional workloads.
Storage technologies sessions
We also had two sessions focused on storage technologies. The scale-up vs scale-out storage architecture session highlighted the two different architectural approaches to storage and described which scenarios were best suited to each.
Paul McLeod from Supermicro described the two different approaches and how Supermicro had the largest portfolio of storage server products for both approaches including a new high availability dual port storage bridge bay (SBB)-type all-flash storage server optimized for scale-up storage. Steve Umbehocker talked about how OSNexus had both a scale-up and scale-out software-defined storage solution QuantaStor and how the Grid management platform unified both types of QuantaStor instances.
Next, Iyer Venkatesan reviewed Intel’s approach to optimizing storage workloads in the Intel Xeon processors using specialized CPU offload accelerators including quick assist technology (QAT), data streaming accelerator (DSA), and volume management device (VMD) which accelerated a variety of workloads from data compression and encryption to NVMe read and write operations and software-based RAID data protection.
In contrast, Graid Technology’s Tom Paquette described Graid’s GPU-based RAID which is optimized for NVMe SSDs and runs outside of the SSD datapath. Jason Zimmerman reviewed Seagate’s NVMe SSDs and the recently launched Mozaic 3 disk drives which are enabling higher capacity disk storage through the first increase in areal density in many years. Finally, Anders Graham described how Koixia’s QLC flash products enable 60TB SSDs and how TLC SSDs are used for higher transaction rate data.
High-performance computing session
The high-performance computing (HPC) session emphasized that while HPC has been around for decades, the new AI workloads are also transforming it. As the historic originators of supercomputing systems, HPC experts bring decades of experience in operating large-scale clustered systems that need to overcome software and hardware faults while running long batch jobs.
Traditional simulation workloads from everything from chip design to aerospace and genomics are now incorporating AI and need to co-exist with generative AI workloads. These new workloads are met with new technologies including new GPUs, interconnects, and rack-level integrations. Nvidia’s CJ Newburn started the discussion with a preview of these new technologies including the recently announced Nvidia Blackwell superchip GPU, new high-speed NVLink GPU interconnects, and new storage protocols like GPU Initiated Storage.
Supermicro’s Randy Kreiser followed up with a photo of the first Supermicro NVLink Rack with the not-yet-releasedNvidia Blackwell GPUs shipped to Nvidia for testing and development. Balaji Venkateshwaran described how DDN’s storage products met the HPC storage requirements of massive scale and multi-tenancy. Bill Panos from Solidigm concluded with how different types of SSDs and NAND were used to meet each phase of the AI lifecycle.
Compute express link session
The final session focused on compute express link, or CXL, which provides a cache-coherent interface for attaching devices to CPUs. Anil Godbole from Intel started by describing Intel’s vision for CXL when it first invented the technology and how the current Intel Xeon processors and future ones will support this technology. Micron’s Andrey Kudryavtsev showed several examples of using CXL for memory expansion including AI training which shortened the training epoch by half by enabling the AI model parameters to be entirely in memory beyond the capacity of direct attached memory.
Finally, Steve Scargall from MemVerge described how the Linux kernel supports CXL and how MemVerge’s Memory Machine software enables use cases like memory pooling and tiering.
You can hear more from these speakers directly at the Supermicro Open Storage Summit. Register at Supermicro Open Storage Summit 2024 to listen for free.
More from Supermicro
-
Sponsored Energy and the promise of AI
Energy and the promise of AI
-
Supermicro Software-Defined Storage
Store your most important assets on flexible, scalable, and cost-effective storage solutions that overcome the limitations of traditional storage appliances.
-
Sponsored Considerations for AI factories
As AI factories that produce intelligence from existing content, new data centers must consider computing technologies available today and how to remove the heat created by these powerful servers