In this series, we examine the range of activities and collaborations that ALCF staff undertake to guide the facility and its users into the next era of scientific computing.
By coordinating efforts to improve early exascale hardware stability at the Argonne Leadership Computing Facility (ALCF), computer scientist Servesh Muralidharan is working to make it easier for application developers to use Aurora testbeds at Argonne’s Joint Laboratory for System Evaluation (JLSE). His work will facilitate a faster transition to the Aurora system upon its delivery and help enable science from Day One.
JLSE, a collaboration between Argonne’s computing divisions, has seen researchers refine several generations of Intel GPU testbeds as the arrival of Aurora approaches. These testbeds consist of preproduction GPU/CPU samples often designed for verification and validation of key architectural features.
The testbeds under Muralidharan's purview are used to develop applications that can eventually run on the Intel Xe GPU and Sapphire Rapids CPU being targeted for Aurora.
As with any early hardware, making these parts usable (that is, ensuring they can performantly run applications) is a challenging process. Intel provides specialized driver components and software development kits (SDK), including the compilers, to run applications on early GPU silicon. These components and SDK require customization to work in the JLSE environment and be employable by application developers participating in the ALCF’s Early Science Program and the U.S. Department of Energy’s (DOE) Exascale Computing Project. This is accomplished by collaborating with multiple teams at Intel.
From his computer science background, Muralidharan has past experience dealing with a variety of early hardware—including characterizing its stability and performance—so as to validate its intended behavior. His work throughout the last year with multiple silicon revisions of the Intel GPU testbed hardware has imparted a deeper understanding of the system’s components and their interactions, such that he is able to maintain and, when necessary, reconfigure them.
Muralidharan’s role with respect to packaging and early hardware—in which he coordinates efforts to deploy early hardware and software with JLSE’s System Operations teams—is twofold.
The first part includes building custom driver stacks and validating hardware behavior once the Systems Operations teams install and configure a server, followed by the challenge of building usable software stacks on top of the hardware. Muralidharan works through different components until reaching the SDK, where the compilers reside.
Second, after a usable testbed is in place, Muralidharan helps diagnose low-level issues that arise from daily system use. These issues range from a specific code causing a hardware fault, to unexpected performance degradation resulting from driver problems. Once the problematic hardware is identified, Muralidharan works with the corresponding Intel team to triage the issue and evaluate suitable patches in the JLSE testbed hardware.