Argonne’s Bethany Lusch and Murali Emani help enable machine learning capabilities on Aurora

staff
ALCF’s Bethany Lusch and Murali Emani help enable machine learning capabilities on Aurora

In this series, we examine the range of activities and collaborations that ALCF staff undertake to guide the facility and its users into the next era of scientific computing.

Bethany Lusch and Murali Emani are computer scientists at the U.S Department of Energy’s (DOE) Argonne National Laboratory, with a decade of high-performance computing (HPC) experience between them. Their current work includes leading efforts to prepare a programming library, oneDAL, and machine learning package, scikit-learn, for the rollout of the upcoming exascale system, Aurora, at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility.

Lusch and Emani, part of the ALCF’s Data Science group, answered questions about their highly collaborative research, which is helping to enable crucial machine learning and data science features on Aurora.

What are oneDAL and scikit-learn?

oneDAL is Intel's data analytics library, and it includes classical machine learning algorithms such as support vector machines and decision forests. Scikit-learn is a popular open-source Python package for machine learning. The Python package Intel Extension for scikit-learn enables acceleration of scikit-learn by employing oneDAL as a backend with only minimal code changes. oneDAL, which is being optimized for Intel GPUs (graphics processing units) so that it can be scaled for full deployment on Aurora, can also be used to speed up XGBoost, a popular open-source Python machine learning library.

What are the challenges in preparing these libraries for Aurora?

Intel leads the efforts to actually write the software and develop the necessary interfaces; what we’re doing is a form of evaluation—we provide input and feedback to Intel, we help prioritize various aspects of development, we communicate with the science teams at Argonne to help ensure the most useful possible product is built, and so on. oneDAL and scikit-Learn both entirely lacked GPU implementations when we started. The other major challenges we face include prioritizing support for the most widely or heavily used algorithms, enabling distributed implementation across multiple GPUs and at full scale on Aurora, and facilitating interoperability with other data science libraries. We also want users to be able to use oneDAL from multiple programming languages, such as Python and C++. There’s also the problem of needing to create new interfaces where none currently exist, but making them familiar or intuitive so it’s easy for the user to port to Intel hardware. We require input from the science teams to determine which features are most important.

How does this work build on prior development or research you’ve done?

We’ve always made use of traditional machine learning algorithms in our research, but the difference is that is in the past we were running them on CPUs, whereas the work being done in preparation for Aurora extends the existing oneDAL implementations to run on Intel GPUs. Being able to run on Intel GPUs will help accelerate the machine-learning training process by leveraging their massive compute capacity. Furthermore, we want to develop the interfaces of required APIs in a way that coordinates a sense of continuity with existing interfaces; this is to help make GPUs accessible to users, and minimizes the difficulties associated with code refactoring.

Who do you collaborate with for this work?

We work closely with the science teams, including projects supported by the Aurora Early Science Program, that use ML algorithms in their research and are interested in integrating their codes with these libraries for deployment on Aurora. We also meet regularly with Intel’s oneDAL team to provide feedback; another part of our work involves testing their software on early hardware at Argonne’s Joint Laboratory for System Evaluation. Because oneDAL interacts with other Python data science libraries, we meet with the relevant Intel teams and the corresponding groups at Argonne.

How has your approach to preparing these libraries changed or evolved throughout the development process?

We started by developing a consensus about which algorithms would be most crucial driven by science use-cases. We then honed our initial discussions by collecting and examining specific case studies from the science teams. More recently we’ve been focused on providing feedback to Intel for advanced features, scaling across GPUs, scaling across nodes, APIs, and general performance.

==========

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.

The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://​ener​gy​.gov/​s​c​ience.

Systems