Performance and portability of computational applications in HPC are a major consideration when migrating them from current system platforms to upcoming exascale architectures which incorporate new designs for processor, memory, and connection technologies for both CPUs and accelerators. This talk focuses on experiences and challenges of porting and deploying GPU-accelerated applications in molecular sciences at scale on leadership computing resources. Performance portability of state-of-the-art HPC applications can be tested with mini-applications, or mini-apps. Here, I will discuss the creation of the MiniMDock miniapp, and its use in the testing of two alternative approaches for performance portability (and productivity) of highly optimized codes: direct translation between different vendorspecific programming models, and transferring from low-level, architecture-specific code to higher-level abstractions, such as middleware like Kokkos or directive-based offloading like OpenMP target offload.
Porting low level, architecture specific (for instance, warp-level) CUDA optimizations to a different architecture is a challenging task due to the very nature of such optimizations. It requires maintaining different versions for each architecture, reducing productivity, but often can provide close to 2X speedups over higher-level versions, which are easier to maintain and port. The speedup for OpenMP target offload can be, further enhanced by the compiler level optimization. Next, I will talk about execution methods used to deploy these applications at scale, and dealing with highly variable loads and loadbalancing problems using novel OpenMP dynamic task-to-gpu scheduling strategies which efficiently distribute tasks across multiple GPUs.