Optimizing OpenMC performance for exascale

science
coding-best-practices

As part of a series aimed at sharing best practices in preparing applications for Aurora, we highlight researchers' efforts to optimize codes to run efficiently on GPU-based systems.

John Tramm, a computational scientist at the U.S. Department of Energy’s (DOE) Argonne National Laboratory, is part of a team working to bring the Monte Carlo neutron and photon transport code OpenMC to the next generation of GPU-driven high-performance computers (HPC). Central to its efforts, the team has been preparing for Aurora, the exascale system being deployed at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility.

OpenMC, originally written for CPU-based HPC and which is capable of using both distributed-memory (MPI) and shared-memory (OpenMP) parallelism, simulates the stochastic motion of neutral particles through a model that, as a representation of a real-world experimental setup, can range in complexity from a simple slab of radiation-shielding material to a full-scale nuclear reactor.

Indeed, the ExaSMR project—supported by the Exascale Computing Project (ECP)—aims to use OpenMC to model the entire core of a nuclear reactor. (Another ExaSMR code, the computational fluid dynamics (CFD) application NekRS, was previously featured in this series.)

Best practices

 

  • -Open source development fosters co-design and greatly accelerates development speed, for both compilers and scientific applications.
  • -Begin development with mini-app to evaluate programming options.

 

Lessons learned

 

  • -While OpenMP offloading compilers had varying levels of maturity at the outset of ECP, close collaboration between application and compiler developers resulted in availability of mature and performant OpenMP compilers for all major GPU architectures.
  • -When porting existing applications to GPU, it is important to empirically revalidate all pre-existing optimizations in the CPU codebase to verify whether they help or hinder with GPU performance.

GPU programming models

Because the GPU landscape changed in reflection of a broadening manufacturing base—now including Intel and AMD in addition to NVIDIA—it became advantageous for the ExaSMR team to move away from proprietary programming models in bringing OpenMC to GPU systems. (Aurora itself will feature Intel GPUs.)

“To this end, a number of ‘performance portability’ models started to appear that were non-proprietary and could theoretically offer portability between different vendor GPUs,” Tramm said.

To determine which programming model to use for their GPU port, the OpenMC team—including, in addition to Tramm, Argonne researchers Paul Romano, Amanda Lund, and Patrick Shriwise—began by coding a mini-app (called XSBench) in as many different GPU programming models as possible to compare and evaluate the various compilers and technology stacks available.

OpenMP target offloading was found to deliver key-kernel performance comparable to that achieved with NVIDIA’s proprietary CUDA, so throughout the following year the team worked to port OpenMC to GPU using this model.

Furthermore, the fact that the primary GPU vendors had committed to providing comprehensive support for OpenMP target offloading weighed in its favor.

Engineering performant code

Tramm noted that the issue of performance is independent of multi-vendor support for OpenMP offloading.

Namely, it doesn’t do us any good if we spend several years porting a massive application into a GPU-oriented programming model and generating new GPU-oriented algorithms and optimizations if the end result is a code that runs slower on GPU than if we had done nothing and just left it as a CPU-only code,” he said, going so far as to label such a scenario as an existential risk for GPU application teams.

The trouble with porting a code can come in multiple forms. First of all, the numerical method of a code may be really challenging to get running efficiently on a GPU architecture, so a lot of research and optimization work may need to be done, and there are no guarantees that you’ll be able to come up with something that works,” he said. “On top of that, the ‘performance portability’ programming model a development team chooses may have a fatal flaw that hamstrings a key kernel and tanks the performance of the application, and there may be no recourse in getting such a problem fixed.”

He saw two primary criteria for achieving performance portability via OpenMP; the first being whether the performance improvements gained from running on GPUs instead of CPUs makes the additional code complexity worthwhile, and the second whether performance is consistently strong and efficient irrespective of which GPUs are being used (that is, whether the code is vendor-agnostic).

Debugging and optimization through collaboration

As a matter of course, the central obstacles that the OpenMC team faced in porting their code were compiler bugs and algorithmic optimization challenges posed by GPUs.

Initial performance of OpenMC on GPU systems was deficient, but collaborations with colleagues were key to successfully solving the problems hindering the software. The team worked to fix compiler bugs by forming close relationships with the Intel and LLVM OpenMP compiler teams, while additional ALCF staff also reported issues and established automated testing routines to identify compiler regressions anytime a new compiler version was released.

This co-design process was invaluable for moving both our application and the OpenMP compiler technology stack forward rapidly, and was key to having our code ready for deployment,” Tramm said. “In some instances, I would report an issue I was seeing with the LLVM compiler to the LLVM OpenMP offloading lead, and five minutes later he would have a patch for the compiler ready for immediate testing. Intel engineers were very proactive in isolating compiler issues we were seeing and getting fixes back to us quickly.

We also had crucial contributions from other ExaSMR participants whose ideas greatly impacted our OpenMC optimization campaign. ALCF staff helped us engineer an on-device sorting technique for use with Intel GPUs that helped with performance a lot. So it was definitely a group effort, and we greatly benefitted from the huge amount of expertise available to us both at Argonne and also from our broader ECP connections.”

By the middle of 2021, Tramm’s team had a reliable OpenMC port running on NVIDIA GPUs via the open-source LLVM compiler. This enabled them to concentrate their development efforts on the algorithmic-optimization aspects of the OpenMC code.

We were able to implement a number of really exciting optimizations,” Tramm said. “Some of them came from other Monte Carlo application teams that had already published some of their experiences running on GPU, so we had some hints already as to a few useful changes that we’d want to put into OpenMC to improve GPU performance. Once these were in, though, we started playing around with our code and developing our own new methods for optimizing on GPU.

“In this period of our development, it seemed like every week or so we would come up with a new idea, add it into the code, and start testing it out—not infrequently seeing 20 percent performance gains stacking up from each change we implemented. Some of the optimizations we developed were useless, while others (like removing memory-intensive particle data caches and sorting particles to improve memory locality) ended up netting massive, multifold speedups.”

By early 2022, his team had largely identified and implemented the key optimizations that would be critical for running OpenMC efficiently on GPUs.

“While we were running extremely fast on NVIDIA GPUs at this point, our AMD and Intel performance was still lacking,” he said. “That meant we still hadn’t really yet achieved performance portability. To this end, we again worked very closely with the Intel and LLVM OpenMP compiler teams to identify issues in their compilers for the Intel and AMD architectures, and found that they were able to implement fixes that started getting us great performance across vendors by early 2023.”

GPU performance results and Aurora

The team’s GPU-oriented version of OpenMC has been completed and is already running on a number of GPU-based supercomputers, including Sunspot—the ALCF’s Aurora testbed and development system—and the ALCF’s NVIDIA-based Polaris. While the team’s goal is focused on honing performance on Aurora, the OpenMP offloading model has resulted in strong performance on every machine on which it was deployed, irrespective of vendor.

Operational success of the ExaSMR project hinges on a key performance parameter for a depleted small modular reactor problem, measured in terms of particle histories per second per GPU. The ECP-defined goal is to a deliver fiftyfold speedup over what could be accomplished at the time of ECP’s inception with the 20-petaflop systems that then represented the state of the art, which equated to approximately 10 million particles per second at full-machine scale.

“Last year, we were able to achieve a major milestone by becoming the first-known Monte Carlo app able to perform 1 million particle histories per second per GPU on a depleted fuel reactor simulation problem with the Intel Ponte Vecchio GPU,” Tramm reported. "This is equivalent to saying that a single six-GPU node of Aurora delivers about as much performance for OpenMC as would approximately 75 high-end dual-socket CPU nodes, or about 3,600 traditional Intel Xeon CPU cores."

For the ExaSMR project, the team next needs to couple OpenMC with the NekRS CFD code using a harness known as ENRICO, and to carry out extreme-scale multiphysics simulations on massively parallel systems, among them Aurora.

Current full-machine projections for OpenMC on Aurora, based on preliminary simulation runs performed on Sunspot, are in the ballpark of 25 billion particle histories per second—indicating a speedup by some 2500x over what could be achieved at full-machine scale at the time of the ECP’s inception (the goal for which, again, had been a fiftyfold speedup). This extraordinary increase is attributable, according to Tramm, to both substantial improvements in GPU architecture and to the novel algorithmic techniques and optimizations implemented in the GPU port of the OpenMC code to take advantage of said architectural improvements.

==========

The Best Practices for GPU Code Development series highlights researchers' efforts to optimize codes to run efficiently on the ALCF's Aurora supercomputer.

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation's first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America's scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy's Office of Science.

The U.S. Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science