Argonne-developed software enables on-the-fly analysis of APS data at the ALCF

science
Misha Salim

Misha Salim, assistant computational scientist at the ALCF, delivers a talk on the Balsam workflow toolkit. (Image: Argonne National Laboratory).

Argonne researchers are using the Balsam service to connect ALCF supercomputers to the Advanced Photon Source, demonstrating its ability to enable near-real-time analysis of experimental data.

When the workflow management tool and edge service Balsam was originally developed at the U.S Department of Energy’s (DOE) Argonne National Laboratory in 2015, it was as part of a collaboration with physicists using Argonne Leadership Computing Facility (ALCF) computing resources to simulate particle collision events from the ATLAS detector at CERN’s Large Hadron Collider, eventually consuming hundreds of millions of core hours on ALCF systems. Balsam has since been used by multiple other ALCF projects to successfully manage similarly large simulation campaigns in computational chemistry, deep learning, and materials science

As the ALCF seeks to accommodate the increasing computing needs of DOE experimental facilities, the Argonne researchers who helped develop Balsam have found a new application for it: enabling the near-real-time analysis of data collected at Argonne’s Advanced Photon Source (APS). The ALCF and APS are both DOE Office of Science User Facilities at Argonne National Laboratory.

“Users of DOE scientific facilities require increasingly powerful computers to keep pace with accelerating data rates,” said Misha Salim, an assistant computational scientist at the ALCF and the lead author of a paper demonstrating Balsam’s utility in this realm. “As facilities like the APS undergo significant upgrades in the coming years, these data rates will continue to speed up, making supercomputers necessary to perform effective data analysis.”

Indeed, once data become available, the beamlines at most experimental facilities require computational resources for processing almost immediately—that is, within minutes or even seconds. The processed data must then be relayed back to the experimental facility.

In particular, the forthcoming APS-Upgrade (APS-U) will yield large amounts of data for X-ray photon correlation spectroscopy (XPCS), a novel technique for leveraging beam coherence to measure the dynamics of a given material.

“Real-time, and near-real-time, data processing is a critical capability needed for the APS-U era facility because it enables experimentalists essentially to respond to feedback—that is, to tweak parameters on the fly as dictated by what the data are saying,” said Nicholas Schwarz, principal computer scientist and leader of the X-Ray Science Division (XSD) Scientific Software Engineering and Data Management Group at the APS.

Balsam was originally designed to facilitate adoption of experimental workloads on ALCF supercomputers, making it easier for experimenters to adopt for their applications and manage large job campaigns, while also allowing their campaigns to be conducted in a way that uses the compute nodes most effectively and achieves the highest throughput.

For the APS workloads, the researchers identified three core criteria necessary to perform real-time analysis of experimental data on leadership-class computing resources: high-speed networking between the experimental and computing facilities; infrastructure capable of handling a diverse array of jobs; and a service to manage incoming datasets and the applications that operate on them. Balsam meets these criteria by design.

“Key to Balsam’s usefulness in this regard is the fact that its reservoir of tasks can grow almost limitlessly,” said Thomas Uram, an ALCF computer scientist and co-author of the paper. “When data rates exceed the availability of compute nodes, Balsam buffers pending tasks in its database and dispatches analysis as resources free up.”

Submitting a Balsam job, he explained, incurs relatively negligible latency, which effectively eliminates the possibility that the task-submission process creates a bottleneck. Directly interacting with job schedulers, in contrast, limits the queue size and requires manual throttling of job submission.

To demonstrate Balsam’s ability to facilitate large-scale data transfers between the ALCF and the APS, Argonne researchers performed two experiments using real XPCS datasets, as detailed in the paper, "Balsam: Near Real-Time Experimental Data Analysis on Supercomputers," presented at the SC19 International Conference for High Performance Computing, Networking, Storage, and Analysis. The paper was recognized with the Best Presentation award at SC19's 1st Annual Workshop on Large-scale Experiment-in-the-Loop Computing (XLOOP).

In the first experiment, a large number of datasets of varying size were utilized. APS hardware submitted an XPCS analysis task to ALCF resources every 15 seconds. Overall, 159 tasks were completed, with 81 gigabytes of data transferred from the APS to the ALCF and back. The researchers measured the time spent sending the data from the APS, processing the data on the ALCF’s Theta machine, and returning the processed data to the APS.

The results showed that the number of tasks submitted was linear over time, meaning that the pipeline was unobstructed. Completed tasks, on the other hand, very closely trailed submissions, demonstrating that the network and computing systems were able to keep pace. Each analysis took roughly 40 seconds, while a new dataset arrived approximately every 20 seconds, so only a small number of compute nodes was necessary.

Secondly, the researchers stress-tested their configuration by deliberately employing extremely large datasets—a significant test of Balsam’s viability, given the comparatively large reams of XPCS data the APS-U is set to generate. In this case, the data submissions were scheduled such that the pipeline was essentially “full” for the entire duration. Each dataset had a much longer runtime on ALCF hardware (some 150 minutes on a single Theta node), making a much greater number of compute nodes necessary for processing to keep pace with submissions. A longer ramp-up time led to a longer lag time between task submission and completion than in the first scenario. Nevertheless, with sufficient bandwidth and node-availability, Balsam was still shown to enable on-the-fly data analysis.

Beyond the APS results presented in the paper, Balsam has been leveraged for at least two projects researching the SARS-CoV-2 coronavirus and the associated disease, COVID-19. Molecular docking simulations modeled the binding of over 300,000 small molecules to 30 protein targets in the coronavirus. As such simulations of binding energy require approximately 15 minutes per ligand per GPU on Theta-GPU (the recent extension to Theta consisting of NVIDIA DGX A100 nodes), the researchers adopted a strategy whereby they would train artificial intelligence (AI) models using less expensive binding-energy estimates. This AI model was then used to screen candidates for the simulations.

Balsam also helped accelerate a simulation modeling the spread of the coronavirus. Designed to investigate the impacts of various mitigation strategies for lockdown, 459 state-wide scenarios were simulated, with ten runs for each scenario. By packing single-node runs into larger job submissions, Balsam achieved a 30 percent speedup.

As the researchers look ahead, they are working to port Balsam services to other high-performance computing environments, a task made easier by Balsam’s original modular design.

“Balsam is entirely application-agnostic,” said Salim. “We’ve already demonstrated Balsam job submission for multiple APS applications, including ptychography and electron microscopy, furthering support for workloads typically found at APS and other DOE light sources.”

Balsam documentation is available here.

 

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

About the Advanced Photon Source

The U. S. Department of Energy Office of Science’s Advanced Photon Source (APS) at Argonne National Laboratory is one of the world’s most productive X-ray light source facilities. The APS provides high-brightness X-ray beams to a diverse community of researchers in materials science, chemistry, condensed matter physics, the life and environmental sciences, and applied research. These X-rays are ideally suited for explorations of materials and biological structures; elemental distribution; chemical, magnetic, electronic states; and a wide range of technologically important engineering systems from batteries to fuel injector sprays, all of which are the foundations of our nation’s economic, technological, and physical well-being. Each year, more than 5,000 researchers use the APS to produce over 2,000 publications detailing impactful discoveries, and solve more vital biological protein structures than users of any other X-ray light source research facility. APS scientists and engineers innovate technology that is at the heart of advancing accelerator and light-source operations. This includes the insertion devices that produce extreme-brightness X-rays prized by researchers, lenses that focus the X-rays down to a few nanometers, instrumentation that maximizes the way the X-rays interact with samples being studied, and software that gathers and manages the massive quantity of data resulting from discovery research at the APS.

This research used resources of the Advanced Photon Source, a U.S. DOE Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by  UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.

The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://​ener​gy​.gov/​s​c​ience.