Researchers optimize codes for Polaris at recent hackathon

science
Hackathon Group

The ALCF-NVIDIA GPU Hackathon hosted a total of 11 teams to help them get their applications running efficiently on high-performance computing machines such as Polaris. (Image: Argonne National Laboratory) 

The ALCF hosted its second GPU Hackathon to help attendees improve application performance on the facility’s high-performance computing resources. 

The powerful computing resources that drive innovation in scientific research evolve rapidly, with new hardware and software technologies emerging constantly.

To help keep the user community apprised of such developments, the Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science user facility at DOE’s Argonne National Laboratory, hosts several annual events to train researchers on how to best utilize various software, systems, and machines to further explore the possibilities of science.

This year, the ALCF in collaboration with NVIDIA and OpenACC Organization, hosted a multi-day virtual hackathon, the first event with access to Argonne’s new Polaris system, an HPE Apollo Gen10+ machine equipped with NVIDIA®  A100 Tensor Core GPUs (graphics processing units) and AMD EPYC processors.

The hackathon is designed to help teams of three to six developers accelerate their codes on ALCF resources using a portable programming model or an AI framework of their choice. Each team is assigned mentors from ALCF and NVIDIA, who utilize their expertise and experience to guide participants on porting their code to GPUs and optimizing its performance.

A total of 11 teams participated this year, researching a vast array of topics including black hole imaging, fusion plasma dynamics, and the development of synthetic genes that will help predict the viral escape of SARS-CoV-2 genomes. With access to Polaris, these teams were able to optimize their codes on the ALCF’s largest GPU-powered system to date.

Brian Homerding image

Argonne's Brian Homerding provides an overview of Polaris hardware at the Hackathon. (Image: Argonne National Laboratory) 

The Polaris software environment is equipped with the HPE Cray programming environment, HPE Performance Cluster Manager (HPCM) system software, and the ability to test programming models, such as OpenMP and SYCL, that will be available on Aurora and the next generation of DOE’s high performance computing (HPC) systems. Those who utilized Polaris this year also benefited from the NVIDIA HPC Software Development Kit (SDK), a suite of compilers, libraries, and tools to support GPU acceleration of HPC modeling and simulation applications

However, the users were not the only ones who benefited from using Polaris during the hackathon, as they were also able to help stress test the system and identify software issues, providing information that helped the ALCF staff improve the software environment on Polaris ahead of its deployment to the broader HPC community. The hackathon attendees also asked several questions on how to use the new system for multiple cases such as submitting jobs, compiling codes, using performance and debugging tools, which in turn helped ALCF staff to improve support documentation.

“While we strive for having all the hardware and software kinks worked out of a new system before opening it up, there are unfortunately always some issues that new users will experience on a system given the great variety of workloads we support at ALCF,” says Chris Knight, ALCF computational scientist. “Opening Polaris access for this group of application developers spanning simulation, data, and learning workloads during the hackathon, where they could work directly with ALCF staff to resolve issues, greatly improved the initial user experience for the rest of the community when they gained access.”

For the Black Hole Hunter team from the Center for Astrophysics | Harvard & Smithsonian, the hackathon offered an opportunity to advance the development of their GPU-based application for processing data observed by the next-generation Event Horizon Telescope (ngEHT) to reconstruct black hole images. Working with their mentors and ALCF computing resource, the researchers set out to enhance the computational efficiency of GPU kernel functions and improve end-to-end throughput by tuning input/output (I/O).

The team learned how to use the NVIDIA Nsight™ Systems tool to analyze the performance of each component of their application, as well as the importance of careful profiling to isolate and treat performance bottlenecks. They discovered that their application spent much more time on memory copies than on GPU computing, indicating a need to increase the concurrency of the two processes. After removing the redundant time monitor code in their GPU module, the team was able to reduce the time consumed per data block from 5 milliseconds to 1.4 milliseconds.

"Our mentors gave us very valuable suggestions and advice how to optimize the modules," says Wei Yu, a member of the Black Hole Hunter team. "We made excellent connections with our ALCF mentors and the entire hackathon team at Argonne and NVIDIA, as well as the participants."

The team plans to further optimize their application with their hackathon mentors helping out as advisors along the way.

Another team attended the hackathon to continue their work to improve the performance of the Nek5000/NekRS computational fluid dynamics code on various advanced architectures. The team, consisting of researchers from Argonne and the University of Illinois, was able to scale to all of Polaris demonstrating an 80% strong-scale efficiency at 3 million points per GPU.

At scale, the team made algorithmic improvements that led to a 10% improvement in time-to-solution. The researchers also demonstrated that running NekRS on Polaris was able to reduce time-to-solution for a Nek5000 benchmark problem for the U. S. Nuclear Regulatory Commission by an order of magnitude compared to the ALCF’s Theta system.

Stay tuned to the ALCF events webpage for details on upcoming facility workshops and training events.

==========

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation's first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America's scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy's Office of Science.

The U.S. Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science