NVIDIA GH200 Superchip Delivers Breakthrough Energy Efficiency and Node Consolidation for Apache Spark

With the rapid growth of generative AI, CIOs and IT leaders are looking for ways to reclaim data center resources to accommodate new AI use cases that promise greater return on investment without impacting current operations. This is leading IT decision makers to reassess past infrastructure decisions and explore strategies to consolidate traditional workloads into fewer, more power-efficient nodes, freeing up data center power and space.

NVIDIA GH200 Grace Hopper Superchip is the first memory-converged CPU-GPU superchip designed from the ground up to meet the challenges of AI, high-performance computing, and data processing. By migrating Apache Spark workloads from CPU nodes to NVIDIA GH200, data centers and enterprises can accelerate query response time by up to 35x. For large Apache Spark clusters of 1,500+ nodes, this speedup translates to up to 22x fewer nodes and savings of up to 14 GWh in annual energy efficiency.

This post explores the architectural innovations of NVIDIA GH200 for data processing, shares SQL benchmark results for GH200, and provides insights on seamlessly migrating Apache Spark workloads to this new platform.

Tackling legacy bottlenecks in CPU-based Apache Spark systems

Over the last decade, enterprises have grappled with the overwhelming volumes of business, consumer, and IoT data, which are increasingly pivotal for maintaining a competitive edge within industries. To address this challenge, many enterprises have turned to Apache Spark, a multi-language open-source system used for big data distributed processing.

Apache Spark began as a research project at University of California, Berkeley with the goal of addressing the limitations of previous big data frameworks. It achieved this by caching data in CPU memory, which significantly accelerated SQL queries. Today, tens of thousands of organizations rely on Apache Spark for diverse data processing tasks spanning a wide array of industries including financial services, healthcare, manufacturing, and retail.

Despite its ability to alleviate the bottleneck of data access from slower hard disks and cloud-based object storage through memory caching, many Apache Spark data processing workflows still encounter constraints due to hardware limitations inherent in CPU architectures.

Pioneering a new era of converged CPU-GPU superchips

Recent advancements in storage and networking bandwidth, along with the ending of Moore’s law, have shifted the focus of analytics and query bottlenecks to CPUs. Meanwhile, GPUs have emerged as the preferred platform for Deep Learning workloads due to their vast number of processing cores and high-bandwidth memory which excel in highly parallelized processing. Parallelizing Apache Spark workloads and running them on GPUs delivers order of magnitude speed-ups compared to CPUs.

Running Apache Spark workloads on GPUs formerly necessitated the transfer of data back and forth between the host CPU and GPU—traditionally bound by the 128 GB/s low speed PCIe interfaces. To overcome this challenge, NVIDIA developed NVIDIA Grace Hopper, a new class of superchips that bring together the Arm-based NVIDIA Grace CPU and NVIDIA Hopper GPU architectures using NVLink-C2C interconnect technology. NVLink-C2C delivers up to 900 GB/s total bandwidth. This is 7x higher bandwidth than the standard PCIe Gen5 lanes found in traditional x86-based GPU accelerated systems.

Comparison of legacy PCIe architecture that has separate CPU and GPU memory and a low bandwidth PCIe connection with the Grace Hopper architecture that has a single unified virtual memory pool with fast NVLINK-C2C connection. — *Figure 1. NVIDIA Grace Hopper architecture overcomes PCIe bottlenecks*

With GH200, the CPU and GPU share a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory that can reside on physical CPU or GPU memory. When adopted, this architecture removes the need to copy memory back and forth between the CPU and GPU.

NVIDIA GH200 sets new highs in NDS performance benchmarks

To measure the performance and cost savings of running Apache Spark on GH200, we used the NVIDIA Decision Support (NDS) benchmark. NDS is derived from the widely used and adopted CPU-only data processing TPC-DS benchmark. NDS consists of the same SQL queries included in TPC-DS with modifications only to data generation and benchmark execution scripts. NDS is not TPC-DS and NDS results are not comparable to official, audited TPC-DS results—only to other NDS results.

Running the 100+ TPC-DS SQL queries with NDS execution scripts on a 10 TB dataset took 6 minutes using 16 GH200 superchips compared to 42 minutes on an equal number of premium x86 CPUs: a 7x end-to-end speedup.

Benchmark comparisons of running NDS-DS Benchmark using SF10 dataset demonstrating how a 16x GH200 cluster delivers 7x query acceleration vs. Premium CPU cluster with equivalent number of nodes. — Figure 2. *NDS-DS benchmark results running Apache Spark with RAPIDS Accelerator on an NVIDIA Grace Hopper16-node cluster using SF10 versus 16-node premium CPU cluster*

Specifically, queries that have a high number of aggregate and join operations exhibited significantly higher acceleration of up to 36x.

Query67, accelerated by 36x, finds top stores for different product categories based on store sales in a specific year. It involves a high number of aggregate and shuffle operations.
Query14, accelerated by 10x, calculates the sum of the extended sales price of store transactions for each item and a specific year and month. It involves a high number of shuffle and join operations.
Query87, accelerated by 9x, counts how many customers have ordered items on the web, the catalog and bought items in a store on the same day. It involves a high number of scan and aggregate operations.
Query59, accelerated by 9x, reports the increase of weekly store sales from one year to the next year for each store and day of the week. It involves a high number of aggregate and join operations.
Query38, accelerated by 8x, displays the count of customers with purchases from all three channels in a given year. It involves a high number of distinct aggregate and join operations.

Reducing power consumption and cutting energy costs

As the datasets grow in size, GH200 delivers even more query acceleration and node consolidation benefits. Running the same 100+ queries on the 10x larger SF100 dataset (100 TB) required a total of 40 minutes on the 16-node GH200 cluster.

Benchmark comparisons of running NDS Benchmark using different dataset sizes (3 GB, 10 GB, 30 GB, and 100 GB) demonstrating how GH200 delivers further query acceleration as dataset size increases. — *Figure 3. NDS benchmark results running Apache Spark 3.4.1 with RAPIDS Accelerator 24.06 on an NVIDIA Grace Hopper 16-node cluster*

Achieving an equivalent 40-minute response time on the 100 TB dataset using premium CPUs would have required a total of 344 CPUs. This translates to a 22x reduction in the number of nodes and 12x energy savings. For organizations running a large Apache Spark CPU cluster, which can sometimes exceed 1,500 nodes, the energy savings are significant, reaching up to 14 GWh annually.

Two side-by-side images of data center racks. The left image shows 1,500 x86 CPU nodes across multiple rows of racks. The right image shows 72 GH200 nodes in just two racks, delivering equivalent performance to the1500 x86 CPU nodes. Text appears on the right image indicating 6x Lower TCO, 22x Fewer Server Nodes and 12x More Energy Efficiency using the GH200 nodes. — *Figure 4. Comparison across TCO, number of nodes, and energy savings of moving a 1,500 x86 Apache Spark cluster to GH200*

Exceptional SQL acceleration and price performance

HEAVY.AI, a leading GPU-accelerated analytics platform and database provider, benchmarked a single GH200 GPU cloud instance against an 8x NVIDIA A100 PCIe-based cloud instance running HeavyDB and the NDS-H benchmark.

Two side-by-side images comparing GPU hardware instances used by HEAVY.AI on Vultr cloud for benchmarking. The image on the left features 8 NVIDIA A100 GPUs with specifications: 640 GB VRAM, 2TB CPU RAM, 112 Intel 8480+ Platinum CPU cores, and a cost of $13.89 per hour. The image on the right displays a single GH200 GPU node with specifications: 96 GB VRAM, 480GB CPU RAM, 72 ARM CPU cores, and a cost of $4.32 per hour. A note appears at the bottom of the image indicating the prices quoted can vary and are for reservations of 1+ month. — *Figure 5. Hardware used by HEAVY.AI during benchmarking*

HEAVY.AI reported an average 5x speedup using the GH200 instance, translating to a 16x cost savings on the SF100 dataset. On the larger SF200 dataset, which does not fit on a single GH200 GPU memory and has to be offloaded to the Grace CPU memory over the low latency high bandwidth NVLink-C2C, HEAVY.AI reported a 2x speedup and 6x cost savings compared to the 8 NVIDIA A100 x86 and PCIe-based instance.

“Our customers make data-driven, time-sensitive decisions that have a high impact on their business,” said Todd Mostak, CTO and co-founder of HEAVY.AI. “We’re excited about the new business insights and cost savings that GH200 will unlock for our customers.”

Bar chart comparing the NVIDIA GH200 Superchip with x86-PCIe based 8xA100 GPUs using the NDS-H benchmark. GH200 demonstrates 5x speedup on the SF100 dataset and 2x speedup on the SF200 datasets. — *Figure 6. HeavyDB and NDS-H benchmark results for HEAVY.AI*

Get started with your GH200 Apache Spark migration

Enterprises can take advantage of the RAPIDS Accelerator for Apache Spark to seamlessly migrate Apache Spark workloads to NVIDIA GH200. RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate processing by combining the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. Enterprises can run existing Apache Spark applications on GPUs with no code change by launching Spark with the RAPIDS Accelerator for Apache Spark plug-in jar.

Today, GH200 powers nine supercomputers around the world, is offered by a wide array of system makers, and can be accessed on demand at cloud providers such as Vultr, Lambda, and CoreWeave. You can also test GH200 through NVIDIA LaunchPad. To learn more about Apache Spark acceleration on GH200, check out the GTC 2024 session Accelerate ETL and Machine Learning in Apache Spark on demand.