NVIDIA today announced a number of big announcements at Computex Taipei 2023, most notably that its Grace Hopper superchips are now in full production. These chips are the core components of NVIDIA’s new DGX GH200 artificial intelligence supercomputing platform and MGX system, which are designed to handle massive amounts of generative artificial intelligence tasks. NVIDIA also announced its new Spectrum-X Ethernet networking platform, optimized for AI servers and supercomputing clusters.
The Grace Hopper superchip is an integrated CPU+GPU solution developed by NVIDIA based on the Arm architecture that integrates a 72-core Grace CPU, Hopper GPU, 96GB of HBM3 and 512GB of LPDDR5X in the same package with a total of 200 billion transistors. This combination provides an amazing data bandwidth between the CPU and GPU of up to 1 TB/s, providing a huge advantage for certain memory-constrained workloads.
The DGX GH200 artificial intelligence supercomputing platform is NVIDIA’s system and reference architecture designed for the most high-end artificial intelligence and high-performance computing workloads. The current DGX A100 system can only combine eight A100 GPUs as a single unit, and given the explosive growth of generative artificial intelligence, NVIDIA’s customers urgently need larger, more powerful systems, and the DGX GH200 is designed to provide maximum throughput and scalability by using NVIDIA’s custom NVLink Switch chip to avoid the limitations of standard cluster connectivity options such as InfiniBand and Ethernet.
Details of the DGX GH200 are less clear, but it has been confirmed that NVIDIA is using a new NVLink Switch system containing 36 NVLink switches to connect 256 GH200 Grace Hopper chips and 144TB of shared memory into a single unit, with NVIDIA CEO Jen-Hsun Huang stating that the GH200 chips are “a giant GPU”. This is the first time NVIDIA has used the NVLink Switch topology to build an entire supercomputer cluster, which NVIDIA says provides 10x more GPU-to-GPU and 7x more CPU-to-GPU bandwidth than previous-generation systems. It is also designed to deliver interconnect power efficiency 5x better than the competition and up to 128 TB/s of pairwise bandwidth. The system has 150 miles (Note: about 241.4 kilometers) of fiber and weighs 40,000 pounds, but presents itself as a single GPU. 256 Grace Hopper superchips boost the DGX GH200’s “AI performance” to exaflop (one million trillion times).
NVIDIA will provide a reference blueprint of the DGX GH200 to its major customers Google, Meta and Microsoft, and will also use the system as a reference architecture design for cloud service providers and hyperscale data centers. NVIDIA itself will also deploy a new NVIDIA Helios supercomputer consisting of four DGX GH200 systems for its own R&D efforts. The four systems have a total of 1024 Grace Hopper chips and are connected using NVIDIA’s Quantum-2 InfiniBand 400 Gb/s network.
NVIDIA DGX is for the highest-end systems, HGX systems are for hyperscale data centers, and the new MGX systems fall in between, and DGX and HGX will coexist with the new MGX systems. NVIDIA’s OEM partners face new challenges in designing servers for AI centers that can slow down design and deployment. NVIDIA’s new MGX reference architecture is designed to accelerate this process, offering more than 100 reference designs.
The MGX system consists of modular designs that cover all aspects of NVIDIA’s CPUs and GPUs, DPUs and networking systems, but also includes designs based on common x86 and Arm processors. NVIDIA also offers air and liquid-cooled design options to suit a variety of application scenarios. ASUS, Gigabyte, Winrock and PEGATRON will all use the MGX reference architecture to develop systems that will be available later this year and into early next year.
As for the new Spectrum-X networking platform, NVIDIA calls it a “high-performance Ethernet for artificial intelligence” networking platform. The Spectrum-X design uses NVIDIA’s 51 Tb/s Spectrum-4 400 GbE Ethernet switch and NVIDIA’s Bulefield-3 DPU, paired with software and SDKs that enable developers to adapt the system to the unique needs of AI workloads.
Compared to other Ethernet-based systems, NVIDIA claims that Spectrum-X is lossless, which provides better QoS and latency. It also features new adaptive routing technology, which is particularly useful in multi-tenant environments.