-
Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos
Authors:
Patrick Diehl,
Panagiotis Syskakis,
Gregor Daiß,
Steven R. Brandt,
Alireza Kheirkhahan,
Srinivas Yadav Singanaboina,
Dominic Marcello,
Chris Taylor,
John Leidel,
Hartmut Kaiser
Abstract:
In recent years, interest in RISC-V computing architectures has moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a concern. As of this year, the first single board RISC-V CPUs implementing the finalized ratified vector specification are being released. The RISC-V vector specification follows in the tradition of vecto…
▽ More
In recent years, interest in RISC-V computing architectures has moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a concern. As of this year, the first single board RISC-V CPUs implementing the finalized ratified vector specification are being released. The RISC-V vector specification follows in the tradition of vector processors found in the CDC STAR-100, the Cray-1, the Convex C-Series, and the NEC SX machines and accelerators. The family of vector processors offers support for variable-length array processing as opposed to the fixed-length processing functionality offered by SIMD. Vector processors offer opportunities to perform vector-chaining which allows temporary results to be used without the need to resolve memory references.
In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D adaptive mesh refinement astrophysics application to study these early RISC-V chips with vector machine support. We report on our experience in porting this modern C++ code (which is built upon several open-source libraries such as HPX and Kokkos) to RISC-V. In addition, we show the impact of the RISC-V Vector extension on a RISC-V single board computer by implementing the std::experimental:simd interface and integrating it with our code. We also compare the application's performance, scalability, and power consumption on desktop-grade RISC-V computer to an A64FX system.
△ Less
Submitted 15 August, 2024; v1 submitted 10 May, 2024;
originally announced July 2024.
-
HPX with Spack and Singularity Containers: Evaluating Overheads for HPX/Kokkos using an astrophysics application
Authors:
Patrick Diehl,
Steven R. Brandt,
Gregor Daiß,
Hartmut Kaiser
Abstract:
Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers…
▽ More
Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers is exemplified by the fact that Microsoft Azure's Eagle cloud service machine is number three on the November 23 Top 500 list. For cloud services, the HPC application and dependencies are installed in containers, e.g. Docker, Singularity, or something else, and these containers are executed on the physical hardware. Although containerization leverages the existing Linux kernel and should not impose overheads on the computation, there is the possibility that machine-specific optimizations might be lost, particularly machine-specific installs of commonly used packages. In this paper, we will use an astrophysics application using HPX-Kokkos and measure overheads on homogeneous resources, e.g. Supercomputer Fugaku, using CPUs only and on heterogenous resources, e.g. LSU's hybrid CPU and GPU system. We will report on challenges in compiling, running, and using the containers as well as performance performance differences.
△ Less
Submitted 7 May, 2024; v1 submitted 11 February, 2024;
originally announced May 2024.
-
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Authors:
Parick Diehl,
Gregor Daiss,
Steven R. Brandt,
Alireza Kheirkhahan,
Hartmut Kaiser,
Christopher Taylor,
John Leidel
Abstract:
In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-…
▽ More
In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is essential. In this paper, we describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. Considering the (limited) capabilities of the RISC-V test systems we used, Octo-Tiger already shows promising results and good scaling. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results.
△ Less
Submitted 17 August, 2023;
originally announced September 2023.
-
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Authors:
Patrick Diehl,
Gregor Daiß,
Kevin Huck,
Dominic Marcello,
Sagiv Shiber,
Hartmut Kaiser,
Dirk Pflüger
Abstract:
The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different…
▽ More
The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE to Octo-Tiger was trivial thanks to using standard C++
△ Less
Submitted 15 March, 2023;
originally announced April 2023.
-
Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL
Authors:
Gregor Daiß,
Patrick Diehl,
Hartmut Kaiser,
Dirk Pflüger
Abstract:
Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us…
▽ More
Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers. However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX's task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one). In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes, most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos' SYCL execution space. We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue.
△ Less
Submitted 8 May, 2023; v1 submitted 4 March, 2023;
originally announced March 2023.
-
From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types
Authors:
Gregor Daiß,
Srinivas Yadav Singanaboina,
Patrick Diehl,
Hartmut Kaiser,
Dirk Pflüger
Abstract:
Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to…
▽ More
Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.
△ Less
Submitted 8 May, 2023; v1 submitted 26 September, 2022;
originally announced October 2022.
-
From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels
Authors:
Gregor Daiß,
Patrick Diehl,
Dominic Marcello,
Alireza Kheirkhahan,
Hartmut Kaiser,
Dirk Pflüger
Abstract:
Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the com…
▽ More
Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.
△ Less
Submitted 4 March, 2023; v1 submitted 26 September, 2022;
originally announced October 2022.
-
Distributed, combined CPU and GPU profiling within HPX using APEX
Authors:
Patrick Diehl,
Gregor Daiss,
Kevin Huck,
Dominic Marcello,
Sagiv Shiber,
Hartmut Kaiser,
Juhan Frank,
Geoffrey C. Clayton,
Dirk Pflueger
Abstract:
Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a…
▽ More
Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a simulation built on HPX, a highly scalable, distributed AMT runtime. We examine the performance of the astrophysics simulation carried-out by Octo-Tiger on two different supercomputing architectures. We analyze the results of scaling and measurement overheads. In addition, we look in-depth at two similarly configured executions on the two systems to study how architectural differences affect performance and identify opportunities for optimization. As one such opportunity, we optimize the communication for the hydro solver and investigated its performance impact.
△ Less
Submitted 21 September, 2022;
originally announced October 2022.
-
Octo-Tiger's New Hydro Module and Performance Using HPX+CUDA on ORNL's Summit
Authors:
Patrick Diehl,
Gregor Daiß,
Dominic Marcello,
Kevin Huck,
Sagiv Shiber,
Hartmut Kaiser,
Juhan Frank,
Dirk Pflüger
Abstract:
Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver.…
▽ More
Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver. Recently, we have remodeled Octo-Tiger's hydro solver to use a three-dimensional reconstruction scheme. In addition, we have ported the hydro solver to GPU using CUDA kernels. We present scaling results for the new hydro kernels on ORNL's Summit machine using a Sedov-Taylor blast wave problem. We also compare Octo-Tiger's new hydro scheme with its old hydro scheme, using a rotating star as a test problem.
△ Less
Submitted 26 July, 2021; v1 submitted 22 July, 2021;
originally announced July 2021.
-
Performance Measurements within Asynchronous Task-based Runtime Systems: A Double White Dwarf Merger as an Application
Authors:
Patrick Diehl,
Dominic Marcello,
Parsa Amini,
Hartmut Kaiser,
Sagiv Shiber,
Geoffrey C. Clayton,
Juhan Frank,
Gregor Daiß,
Dirk Pflüger,
David Eder,
Alice Koniges,
Kevin Huck
Abstract:
Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Oct…
▽ More
Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Octo-Tiger full 3D AMR astrophysics application. This enables the combined visualization of physical and performance data to highlight bottlenecks with respect to different solvers. We examine the overhead introduced by these measurements, which is around 1%, with respect to the overall application runtime. We perform a convergence study for four different levels of refinement and analyze the application's performance with respect to adaptive grid refinement. The measurements' overheads are small, enabling the combined use of performance data and physical properties with the goal of improving the code's performance. All of these measurements were obtained on NERSC's Cori, Louisiana Optical Network Infrastructure's QueenBee2, and Indiana University's Big Red 3.
△ Less
Submitted 9 June, 2021; v1 submitted 30 January, 2021;
originally announced February 2021.
-
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions
Authors:
Gregor Daiß,
Parsa Amini,
John Biddiscombe,
Patrick Diehl,
Juhan Frank,
Kevin Huck,
Hartmut Kaiser,
Dominic Marcello,
David Pfander,
Dirk Pflüger
Abstract:
We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems,…
▽ More
We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems, Octo-Tiger relies on high-level programming abstractions.
We use HPX with its futurization capabilities to ensure scalability both between nodes and within, and present first results replacing MPI with libfabric achieving up to a 2.8x speedup. We extend Octo-Tiger to heterogeneous GPU-accelerated supercomputers, demonstrating node-level performance and portability. We show scalability up to full system runs on Piz Daint. For the scenario's maximum resolution, the compute-critical parts (hydrodynamics and gravity) achieve 68.1% parallel efficiency at 2048 nodes.
△ Less
Submitted 9 August, 2019; v1 submitted 8 August, 2019;
originally announced August 2019.