Aurora software development: Developing deep learning frameworks for exascale

staff
Corey Adams

In this series, we examine the range of activities and collaborations that ALCF staff undertake to guide the facility and its users into the next era of scientific computing.

Argonne National Laboratory's Corey Adams is leading efforts to deploy advanced deep learning frameworks on Aurora, the forthcoming exascale system set for delivery next year at the Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science User Facility.

Adams, a computer scientist in the ALCF’s Data Science Group, has a joint appointment with Argonne’s Physics Division. His research lies at the intersection of deep learning, AI, and fundamental physics, encompassing contributions to Aurora non-recurring engineering (NRE) efforts that target Aurora Early Science Program (ESP) projects, including connectomics brain-mapping work and the CANDLE virtual drug response prediction and cancer treatment application, in addition to applications for astrophysics, neutrino physics, lattice quantum chromodynamics, Argonne’s Advanced Photon Source (APS), and the Large Hadron Collider.

Collaborating with Intel to deliver Aurora

With the arrival of Aurora drawing closer, the Data Science Group is working to ensure that the AI applications set for deployment on the system will be fully performant on Day One—that they well run and scale well in relatively bug-free implementations. To this end, Corey and his colleagues selected a number of Argonne workloads that represent innovative AI-for-science approaches that will benefit from the Aurora architecture.

In doing so, in order to grow application capabilities from a science perspective, they built on computer vision benchmarks established by Intel—for whose deep learning projects Adams serves as the ALCF point of contact—during the development of various deep learning and AI frameworks.

Performance tracking is twofold: Intel reports metrics for the selected applications, while the Argonne team uses GitLab CI/CD on Joint Laboratory for System Evaluation (JLSE) testbeds to track application performance and stability, conducting tests on a weekly basis.

Scaling up and scaling out with different deep learning frameworks

Deep learning frameworks can be scaled up or scaled out.

The former process, scaling up, is to optimize an application for the fastest possible performance on a single graphics processing unit (GPU). Scaling out, on the other hand, distributes an application across multiple GPUs. The ALCF anticipates that Aurora, like other upcoming exascale systems, will derive most of its power from GPUs.

High-level frameworks in Python, such as TensorFlow and PyTorch, rely on Intel’s deep neural network (DNN) framework oneDNN for computationally intensive GPU processes such as convolution operations, the complex demands of which frustrate attempts at out-of-the-box performance. This necessitates extensive iterations of development and testing before an efficient kernel or source code can be produced.

Once optimal performance has been achieved on a single GPU, the Intel Collective Communications Library oneCCL helps deliver optimal performance on multiple GPUs by distributing optimized communications patterns to allocate parallel model training among arbitrarily many nodes. OneCCL and the synchronicity it encourages thereby enable tasks such as the uniform collection of gradients from a training iteration.

The oneDNN framework provides fast concentrated performance in a single GPU, in other words, whereas oneCCL provides fast distributive GPU performance across multiple GPUs.

To obtain more detailed benchmarks, Adams and his team are collaborating with Intel to track the performance of oneDNN and oneCCL independent of each other and independent of additional GPU operations.