As part of a series aimed at sharing best practices in preparing applications for Aurora, we highlight researchers' efforts to optimize codes to run efficiently on graphics processing units.
To ready the ATLAS experiment for the exascale era of computing, developers are preparing the codes that will enable the experiment to run its simulation and data analysis tasks on an array of next-generation architectures, including that of the forthcoming Intel-HPE Aurora system to be housed at the Argonne Leadership Computer Facility (ALCF), a U.S. Department of Energy Office of Science User Facility at Argonne National Laboratory.
The ATLAS experiment—located at CERN’s Large Hadron Collider (LHC), the world’s largest particle accelerator—employs an 82-foot-tall, 144-foot-long cylinder outfitted with magnets and other instruments interwoven about a central beampipe to measure and record phenomena relating to the subatomic particles dispersed by proton-proton collisions. The signals generated by the ATLAS instruments yield important insights into the physics of the collision and its effects via computational reconstruction of the events.
An ATLAS physics analysis consists of three steps. First, in event generation, researchers use the physics that they know to model the kinds of particle collisions that take place in the LHC. In the next step, simulation, they generate the subsequent measurements the ATLAS detector would make. Finally, reconstruction algorithms are run on both simulated and real data, the output of which can be compared to see differences between theoretical prediction and measurement.
Such measurements led to the discovery of the Higgs boson, but hundreds of petabytes of data still remain unexamined, and the experiment’s computational needs will grow by an order of magnitude or more over the next decade. These needs will be compounded by the impending completion of the High-Luminosity LHC upgrade project.
Moreover, ATLAS requires immense amount of simulation for Standard Model and background modeling, as well as for general detector and upgrade studies.
To this end the developers are creating code that can be utilized on a multiplicity of architectures.
FastCaloSim, a code used for fast parametrized calorimeter simulation, has been written using CUDA, SYCL, and Kokkos, and has also been run on the ALCF’s Aurora testbed—which is to say it’s been tested on Intel architectures, NVIDIA architectures, AMD architectures, and GPU-based systems, in addition to non-accelerated setups.
Meanwhile the developers are also implementing a Kokkos version of MadGraph, an event simulator for LHC experiments that performs particle-physics calculations to generate expected LHC-detector particle interactions.
As a framework, MadGraph aims at a complete Standard Model and Beyond Standard Model phenomenology, including such elements as cross-section computations as well as event manipulation and analysis.
The developers began with a CUDA implementation of the MadGraph algorithm. They then ported it to Kokkos, which, they found, was able to run the algorithm on an Intel CPU system. Next, using OpenMP as the backend of a threaded parallel setup, MadGraph was deployed on an NVIDIA GPU-based setup. With CUDA as the backend, it was executed on the Intel GPU testbeds housed at Argonne’s Joint Laboratory for System Evaluation (JLSE).
Kokkos is preferred in the case of MadGraph because it is a third-party programming library written in C++ that enables developers to write their code in a single framework: when Kokkos-structured code is compiled, it is compiled for whatever architecture it is running on. This is of benefit to high energy physics researchers as they need write the complex algorithms near the center of their work only once (as opposed to rewriting the algorithms multiple times so as to create specific variants compatible with each chip manufacturer’s software of choice).
Due to its preponderance of complex calculations, MadGraph demands a large amount of compute time—and it will only demand more in the future, particularly after the high-luminosity upgrade to the LHC is completed mid-decade. Subsequent to the upgrade the data-throughput will increase 10 or 20 times over, but this merely establishes a minimum baseline for the amount of simulation to be generated: achieving optimal performance would necessitate increasing throughput by yet another factor of 10 (that is, 100- or 200-fold simulation growth would be required).
Because one of the developers’ primary goals is to understand the limits of performance portability of various architectures, the performance of each different code implementation was compared to the others. The Kokkos version of MadGraph was able to achieve comparable performance to the CUDA and “vanilla” CUDA implementations, with metrics falling with 10 percent of each other.
The main hurdle to implementing a Kokkos version of FastCaloSim was that use of shared libraries requires every symbol in a kernel to be visible to and resolvable in a single compilation unit, thereby necessitating the wrapping of all kernels with a single file that encompasses the various remaining kernel files via an array of
The developers refactored a large number of functions and files so as to minimize code duplication while maximizing the number of identical code paths between the CUDA-Kokkos implementations.
An advantage to using the CUDA backend of Kokkos is its full interoperability with “pure” CUDA, meaning that one can call CUDA functions from Kokkos kernels. This interoperability enabled an incremental porting process while having the added benefit of simplifying validation.
Consonant with their focus on diversity of architecture, the developers utilized a broad range of Kokkos backends for FastCaloSim, including various NVIDIA GPUs, several AMD GPUs, Intel integrated discrete GPUs, and host parallel pThreads and OpenMP backends.
The developers concluded that FastCaloSim greatly underutilized GPU power. While this was somewhat attenuated by grouping data between multiple events, more complex programs might require significant code-refactoring. However, the underutilization suggests that a single GPU could be shared between multiple CPU processes, thereby reducing hardware expenses.
Comparing the Kokkos and CUDA variants of FastCaloSim, CUDA portability layer concepts tend to translate well in general, even if the explicit syntaxes differ, though this is not to say certain elements (such as views and buffers) do not carry overhead burdens.
Ultimately, after considerable effort and amid a rapidly evolving compiler landscape, FastCaloSim successfully ran on each “flavor” of GPU attempted: Intel iGPUs and Xe-HP GPUs using DPC++, NVIDIA GPUs using SYCL with a CUDA backend, and AMD GPUs using hipSYCL (an implementation of SYCL over NVIDIA CUDA/AMD HIP).