Argonne’s new Sunspot testbed provides on-ramp for Aurora exascale supercomputer

announcements
Sunspot

Sunspot is a two-rack test and development system equipped with 128 nodes of the same technologies that will power Argonne's Aurora exascale supercomputer. (Image by Argonne National Laboratory)

With hardware that is identical to Aurora, Sunspot is enabling researchers to advance efforts to scale and optimize codes for the ALCF's exascale system.

Researchers preparing scientific codes and workloads to run on the Aurora exascale supercomputer at the U.S. Department of Energy’s (DOE) Argonne National Laboratory now have a new resource at their disposal.

Sunspot

Sunspot is powered by the same Intel Xeon CPU Max Series processors and Intel Data Center GPU Max Series processors that are found in Aurora. (Image by Argonne National Laboratory)

Named Sunspot, the new test and development system has the exact same architecture as Aurora, which is currently under construction at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility. Aurora, an Intel-Hewlett Packard Enterprise (HPE) system, will be comprised of more than 10,000 nodes, each equipped with two new Intel Xeon CPU Max Series processors and six Intel Data Center GPU Max Series processors. Sunspot is a two-rack testbed with 128 nodes of the same technologies.

“Sunspot is basically a miniature version of Aurora,” said Susan Coghlan, ALCF project director for Aurora. “It gives teams a platform to optimize code performance on the actual Aurora hardware, including the system’s Intel CPUs (central processing units) and GPUs (graphics processing units), and the HPE Slingshot interconnect that connects all the components together.”

Prior to Sunspot’s arrival, development teams leveraged earlier Aurora testbeds (Iris, Arcticus, and Florentia at Argonne, and Borealis at Intel) and DOE supercomputers, including Argonne’s Polaris, to carry out exascale code development. While those systems continue to be useful tools for Aurora preparations, Sunspot’s identical architecture gives researchers an ideal environment to further optimize application performance for the exascale supercomputer.

“Test and development systems are an important on-ramp for larger production supercomputers,” said Tim Williams, co-manager of the ALCF’s Aurora Early Science Program (ESP). “With our Early Science Program for new supercomputers, the goal is to be ready for science on day one of deploying a new system. Testbeds like Sunspot allow researchers to carry out performance studies and scale up their workloads to run on much larger supercomputers while those systems are still being built.”

Since Sunspot’s launch in December, over 180 researchers from over 20 application development teams from the ESP and DOE’s Exascale Computing Project (ECP) have begun accessing the testbed for scaling and performance optimization research. The Aurora ESP is supporting 15 research teams tasked with preparing key applications for the architecture and scale of the new supercomputer, with a strong emphasis on incorporating data-intensive computing and AI applications. In the process, the ESP teams also help solidify software libraries and infrastructure to pave the way for other researchers to run on the system. The ECP, on the other hand, is a broader effort with a similar end goal. Launched in 2016, the ECP is a massive multi-institutional initiative focused on building a capable exascale computing ecosystem. This includes developing the applications, software, and hardware technologies that will support science on the nation’s first exascale systems.

Williams noted that the ESP and ECP teams’ early runs on the Intel Max Series GPUs have been promising. At the recent HPC Asia 2023 conference, Williams and colleagues — Venkat Vishawanath, ALCF data science team lead and ESP co-manager, and Scott Parker, ALCF performance engineering team lead — presented some initial performance results compared to leading alternative GPUs.

  • As part of the ECP ExaSMR (Exascale Small Modular Reactor) project, researchers achieved 30-70% performance improvements with NekRS, a GPU-oriented thermal-fluids simulation code, across a set of benchmark problems.
  • Another ExaSMR code, OpenMC, which is used for neutron and photon transport simulations, showed a 205% performance advantage on the Intel GPUs.
  • Supported by ESP and ECP projects, the Argonne-developed Hardware/Hybrid Accelerated Cosmology Code (HACC) has seen 2.6x speedups in early runs on the hardware.
  • QMCPACK, a quantum Monte Carlo code used for electronic structure calculations, has shown a 50% improvement in runs thus far. QMCPACK’s exascale development is supported by both ESP and ECP.
  • XGC, a fusion plasma simulation code that is also supported by ESP and ECP, has performed 60% faster using an initial test problem.

The ALCF team expects the codes to see further performance improvements as researchers continue to do multi-node scaling and optimization work on Sunspot and other available computing resources. The ALCF is also using the testbed for various Aurora training events, including ESP hackathons and a tutorial at the ECP’s recent 2023 annual meeting.

In addition to helping researchers prepare applications for Aurora, Sunspot is also extremely valuable to the ALCF and Intel as they continue work to stand up the exascale system. For example, the team is using Sunspot’s Intel DAOS (Distributed Asynchronous Object Storage) storage system to test and enhance I/O performance.

“Sunspot is the first time we’re seeing how everything is working together,” Coghlan said. “We learn a lot from these runs. It gives us a chance to iron out some of the kinks before Aurora is ready for users.”

“Some bugs don't show up until you start running real applications on the hardware, that’s the whole idea behind the Early Science Program,” Williams added. “These early runs help with uncovering and, in some cases, actually diagnosing issues.”

Sunspot is expected to serve a role even after Aurora is powered on. Like the ALCF’s previous test and development systems, Sunspot can be a proving ground for new users to test and optimize code performance before moving to Aurora. ALCF staff can also use it to validate and benchmark new software that is targeted for Aurora.

==========

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation's first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America's scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy's Office of Science.

The U.S. Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science

 

Allocations