As part of a new series aimed at sharing best practices in preparing applications for Aurora, we highlight researchers' efforts to optimize codes to run efficiently on graphics processing units (GPUs).
So as to prepare for next-generation systems like Aurora, developers at the U.S. Department of Energy’s (DOE) Argonne National Laboratory are working to port the SW4 application—a multidisciplinary simulation code for earthquake hazard and risk assessment—to run on GPU-based Intel machines. As part of the EQSIM project supported by DOE's Exascale Computing Project, Brian Homerding of the Argonne Leadership Computing Facility (ALCF) and Houjun Tang of Lawrence Berkeley National Laboratory are leading an effort to use the C++ abstraction library RAJA, whose SYCL backend is currently being written.
RAJA is a software library of C++ abstractions targeting portable, parallel loop execution. These abstractions insulate the application from the backend programming model details. Developers port their RAJA application to new backends by implementing template parameter execution policies, which are typically stored in header files. These execution policies include statements that express how loops should be executed and how indexes should map to the backend indexes. This allows the kernel body to remain unchanged while porting to a new backend. To ensure that the abstraction layer does not introduce overhead, the RAJA Performance Suite is used to assess the performance of loop-based HPC kernels implemented in both RAJA and the underlying backend programming model.
Previous releases of SW4 were OpenMP implementations for multithreaded CPU execution. Recent releases utilize RAJA with implemented execution policies using OpenMP and CUDA statements for targeting CPUs and NVIDIA GPUs respectively. The RAJA SYCL and OpenMP-Target backends will be available for execution on Aurora. The existing execution policies will be implemented for these backends.
The porting effort was initiated with the SW4lite proxy application, which provided a development vehicle for driving preparation while also allowing the developers to quickly identify issues for rapid resolution. The RAJA-SYCL backend execution policies have been implemented in the SW4lite proxy application for early testing and experimentation.
Enabling the RAJA on Intel devices has been accomplished by utilizing oneAPI and several extensions in the DPC++ compiler. Intel’s Unnamed kernel lambdas are critical for portability libraries to support general kernel execution. The Unified Shared Memory extension allows abstraction libraries to decouple loop execution from memory management. Intel’s Extended Atomics and Global ID access have enabled support for the RAJA reduction object.
The developers have also made important use of many features of the SYCL programming model. Principal among these is the use SYCL nd_ranges to support fine-grained control over loop execution. The nd_ranges provide the flexibility required by a library to handle complex and simple loop executions. Through nd_ranges the RAJA-SYCL backend can launch simple one-dimensional SYCL kernels or complex three dimensional kernels with explicit work group sizes.
The ALCF is a DOE Office of Science User Facility.