In the world of high-performance computing (HPC), the convergence of artificial intelligence (AI) and data science with traditional modeling and simulation is changing the way researchers use supercomputers for scientific discovery.
To help scientists find their footing in this ever-changing landscape, the Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science User Facility, has placed a premium on training researchers to use emerging AI and machine learning tools and techniques efficiently on its supercomputing resources.
Through a regular series of workshops and webinars and a special allocation program called the ALCF Data Science Program (ADSP), the facility has been working to build a community of scientists who can employ these methods at a scale that requires the DOE’s leadership-class computing resources.
“Ten years ago, people were primarily only using our supercomputers for numerical simulations. These large simulations output a lot of data, but they didn't need a lot of data input to do their analysis,” said Taylor Childers, ALCF computer scientist. “With AI, deep learning and machine learning bursting onto the scene in the last five years, we’ve put a lot of effort into onboarding new research teams that are not accustomed to using HPC.”
Childers was one of the organizers of the ALCF’s recent Simulation, Data, and Learning Workshop, an annual event focused on helping participants improve the performance and productivity of data-intensive and machine learning applications. Held virtually in December, the event welcomed more than 100 attendees from across the world.
“Our workshops not only provide guidance on using AI and HPC for science, they also give people an opportunity to engage with us and find out how ALCF resources can potentially benefit their research,” Childers said.
In addition to training events, the ALCF has been building the hardware and software infrastructure needed to support research enabled by the confluence of simulation, data, and learning methods. On the hardware side, the facility recently deployed ThetaGPU — an upgrade to its Theta supercomputer powered by graphics processing units (GPUs). The augmented system provides enhanced capabilities for data analytics and AI training and learning. The ALCF’s next-generation systems, Polaris and Aurora, will also be hybrid CPU-GPU machines designed to handle AI and data-intensive workloads.
In the software space, the ALCF has been building up its support for machine learning frameworks, including TensorFlow, PyTorch, Horovod, and Python-based modules, libraries, and performance profilers, as well as tools for data-intensive science, such as JupyterHub and MongoDB. In addition, the facility has developed and deployed its own tools, such as the Balsam HPC workflow and edge service, and DeepHyper, a scalable, automated machine learning package for hyperparameter optimization and neural architecture search. Together with the expertise of ALCF staff, these hardware and software tools are helping researchers open new frontiers in scientific computing.
In 2020, the ALCF’s traditionally in-person workshops transitioned to virtual events due to the COVID-19 pandemic. While the facility has been hosting online training webinars for years, the ALCF Computational Performance Workshop in May was its first large-scale user workshop to go completely virtual. Leveraging tools like Zoom for video conferencing and Slack for instant messaging and collaboration, the ALCF’s virtual workshops have been successful in recreating the collaborative, hands-on nature of its on-site training events. For the Simulation, Data, and Learning Workshop, the ALCF employed a GitHub repository that contained all of the code and instructions for the planned activities.
To make the experience more engaging, the workshop was structured entirely around hands-on tutorials with opportunities to interact throughout. Day one was designed to help participants get distributed deep learning code and data pipelines running on ALCF systems; day two covered using DeepHyper for hyperparameter optimization and how to profile and improve application performance; and day three was dedicated to getting the attendees’ deep learning networks deployed at scale in a simulation.
“The first day covered what you need to scale up a deep learning problem from your laptop to ALCF resources,” Childers said. “The next two days were successive advancements, focused on improving performance and ultimately on how to use the trained model in your research.”
The virtual format has the added benefit of making events more accessible to a larger base of participants. While a majority of attendees were from U.S. institutions, the workshop also welcomed international participants from England, Argentina, and Ghana.
“Virtual events have always meant an open chance to participate in events that I would never have been able to otherwise,” said Dario Dematties, a postdoctoral researcher at the CONICET Mendoza Technological Scientific Center in Argentina.
Attendees take their research to the next level
For Dematties, the workshop presented an opportunity to advance his work involving contrastive learning for visual representations, an approach that utilizes machine learning to identify similarities and differences in images. He was conducting his research with a Director’s Discretionary allocation on Cooley, the ALCF’s visualization and analysis cluster, but was having trouble getting his workflow to run on several nodes. Working with ALCF staff at the workshop, Dematties transitioned his work to the larger ThetaGPU system.
“After attending the first section on distributed deep learning, I decided to approach some experts. We didn't succeed on the first try, but we did on the last day of the workshop,” Dematties said. “I had never run a PyTorch machine learning workflow on so many distributed GPUs. That was amazing. Thanks to this event, I now have an approved allocation for running my project on ThetaGPU.”
Maruti Mudunuru, an Earth scientist at DOE’s Pacific Northwest National Laboratory, attended the workshop to learn how distributed deep learning and hyperparameter optimization can be used to enhance watershed modeling.
“The hands-on-sessions helped me a lot,” Mudunuru said. “I will be applying what I learned about developing scalable deep learning models that use Horovod and DeepHyper to my Earth science research.”
After the workshop, Mudunuru submitted a proposal for a Director’s Discretionary award to continue developing a scalable deep learning workflow that enables fast, accurate, and reliable calibration of watershed-based process models.
“The focus of my research at the ALCF is to advance watershed modeling at the system-scale using machine learning and develop reduced-order models of river corridor processes,” Mudunuru said. “My goal is to test a proof of concept with my discretionary allocation. If my outcomes are successful, I plan to submit a proposal for the ALCF Data Science Program next year.”
Improving the performance of ongoing projects
In addition to helping new users scale up their research for ALCF systems, the workshop is also useful for existing facility users looking to employ new tools and techniques that can accelerate their research.
Argonne’s Ming Du, for example, attended the workshop to learn how machine learning frameworks can advance his project aimed at developing an accurate and efficient HPC framework for solving the dense 3D reconstruction problem in X-ray microscopy. The research began as part of an ADSP project and is now being pursued under a project awarded by DOE’s Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge.
“The curriculum of the workshop aligned very well with our project needs,” said Du, a postdoctoral researcher at the Advanced Photon Source (APS), a DOE Office of Science User Facility located at Argonne.
During the hands-on sessions, Du worked with ALCF staff members to perform a scaling test for distributed training using the PyTorch DistributedDataParallel (DDP) module.
“This experience has pointed us to a clear pathway for improving the scaling performance of our framework in the future,” Du said. “We are planning to employ the more efficient DDP, which is expected to reduce the communication overhead of our application by a huge factor.”
Du also benefitted from the workshop session focused on coupling simulation in C++ with deep learning in Python. He plans to use the knowledge he picked up at the event to train a deep neural network surrogate model that will be used to perform wave propagation simulations more quickly and efficiently than their current method.
The opportunity to use ThetaGPU at the workshop was another perk that will help Du and his colleagues prepare their research for the ALCF’s next-generation systems.
“This was the first time I got a chance to run applications on this machine, and that experience will be extremely helpful for us to optimize our framework for both ThetaGPU and the upcoming Aurora system, which will also be a GPU-accelerated machine,” Du said.
The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.
About the Advanced Photon Source
The U. S. Department of Energy Office of Science’s Advanced Photon Source (APS) at Argonne National Laboratory is one of the world’s most productive X-ray light source facilities. The APS provides high-brightness X-ray beams to a diverse community of researchers in materials science, chemistry, condensed matter physics, the life and environmental sciences, and applied research. These X-rays are ideally suited for explorations of materials and biological structures; elemental distribution; chemical, magnetic, electronic states; and a wide range of technologically important engineering systems from batteries to fuel injector sprays, all of which are the foundations of our nation’s economic, technological, and physical well-being. Each year, more than 5,000 researchers use the APS to produce over 2,000 publications detailing impactful discoveries, and solve more vital biological protein structures than users of any other X-ray light source research facility. APS scientists and engineers innovate technology that is at the heart of advancing accelerator and light-source operations. This includes the insertion devices that produce extreme-brightness X-rays prized by researchers, lenses that focus the X-rays down to a few nanometers, instrumentation that maximizes the way the X-rays interact with samples being studied, and software that gathers and manages the massive quantity of data resulting from discovery research at the APS.
This research used resources of the Advanced Photon Source, a U.S. DOE Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357.
Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.
The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.