As part of a series aimed at sharing best practices in preparing applications for Aurora, we highlight researchers' efforts to optimize codes to run efficiently on graphics processing units.
Developed in tandem with the Exascale Computing Project-supported Whole Device Model Application project—which aims to build a high-fidelity model of magnetically confined fusion plasmas to plan experiments on ITER—XGC is a gyrokinetic particle-in-cell code used to perform large-scale simulations on DOE supercomputers, and optimized for treating edge plasma in particular. The code is the product of a consortium of researchers from academia and U.S. Department of Energy (DOE) laboratories including Argonne National Laboratory, Princeton Plasma Physics Laboratory, and Oak Ridge National Laboratory.
By design, XGC—which currently runs on the vector CPU-based Theta machine housed at Argonne Leadership Computing Facility, as well as on the Oak Ridge Leadership Computing Facility’s GPU-accelerated Summit system—is capable of solving boundary multiscale plasma problems across the magnetic separatrix (that is, the boundary between the magnetically confined and unconfined plasmas) using first-principles-based kinetic equations.
To prepare for the next generation of high-performance computing—exemplified by the ALCF’s forthcoming Polaris and Aurora systems—the code is being re-implemented for exascale using a performance-portable approach. Running at exascale will yield unique computational capabilities, some of which carry the potential for transformational impacts on fusion science: exascale expansion will make it possible to study, for instance, a larger and more realistic range of dimensionless plasma parameters than has ever been achieved, along with the full spectrum of kinetic micro-instabilities that control the quality of energy confinement in a toroidal plasma. Further, exascale will enable physics modeling that incorporates multiple-charge tungsten ion species—impurities, that is, released from the tokamak vessel walls that impact edge-plasma behavior and fusion performance in the core-plasma through migration across the magnetic separatrix.
Preparation for the exascale Aurora machine in a way that is portable to other architectures is employing high-level, non-machine-specific libraries and programming models—Kokkos and Cabana. While the former predates the Exascale Computing Project, both Kokkos and Cabana are targeting first-generation exascale computing platforms.
As a best practice for code development, the XGC team capitalizes on these efforts by employing higher-level interfaces and libraries. In so doing, they can directly benefit from the work being performed by library and programming model developers.
Moreover, without making any changes to their code, the team will be able to take advantage of the upcoming SYCL/DPC++ implementation of Kokkos—which is expected to be highly performant and broadly portable across architectures—immediately at the time of its release. Meanwhile, the team is working with an early OpenMP-target implementation of Kokkos.
The team’s application can, from the outset, run on any platform that supports the underlying software. These factors led the team to change XGC from using vendor-specific programming approaches (such as OpenACC, CUDA, and Fortran) to using, for GPU acceleration, Kokkos and Cabana.
Once the change was affected and the relevant programming layers were integrated into the XGC code, the team achieved comparable or improved performance relative to that of vendor-specific implementations.
XGC contains two compute-heavy kernels: one for kinetic electron push and one for nonlinear Fokker-Planck collisions. When GPUs are not utilized, these kernels occupy more than 95 percent of production-run compute time.
The electron-push kernel was the first application component implemented in the past, using OpenMP threads for CPUs and vectorization techniques for architectures like Intel Knights Landing (KNL), and using CUDA Fortran for NVIDIA GPUs.
It was re-implemented using the Cabana library, a layer on Kokkos for implementing particle codes. Subsequent to this implementation, it was found that with minimal additional effort on the part of the XGC team, the compute matched or exceeded the performance exhibited by the previous kernel implementation running on Summit, and matched or exceeded the performance exhibited by the previous kernel implementation running on Theta.
The collision kernel, as well, after being ported using Kokkos, has demonstrated comparable or improved performance relative to its OpenACC implementation.
Optimal performance for assorted architectures has been achieved through the use of convenient structures for storing particle data that Cabana provides in combination with the Kokkos implementation that it relies on.