This project is training AI coding assistants built for supercomputing—able to read huge codebases, generate and optimize parallel programs across diverse hardware, and explain their choices—to speed up and strengthen scientific software development.
Large Language Models (LLMs) have become an important part of the toolchain for software development, but existing LLMs are not designed to handle the specialized tasks in High-Performance Computing (HPC), such as parallel code generation, performance optimization, and managing hardware heterogeneity.
This project will train next-generation LLMs on Department of Energy (DOE) supercomputers, Aurora, Frontier, and Perlmutter, to support HPC software at scale. The models will be designed to handle very long code contexts and incorporate multiple sources of information, including performance traces and documentation. Beyond parallel code generation, the models will be trained to reason about performance and correctness of generated parallel code. Additionally, new attribution techniques will be developed to better understand how specific training examples influence a model’s output, improving transparency and trustworthiness. Although broadly applicable, this work will focus on DOE’s Extreme-scale Scientific Software Stack (E4S), a flagship software ecosystem developed under the Exascale Computing Project (ECP), to demonstrate how LLMs can accelerate software development across complex and heterogeneous systems.
The resulting multi-modal, performance-aware, explainable LLMs are expected to revolutionize scientific software development and boost HPC developer productivity by significantly reducing the manual effort required for porting, tuning, and maintaining software across existing and emerging hardware platforms. This work will enhance the long-term sustainability and usability of the DOE's software ecosystem, accelerate adoption of E4S across the scientific community, and advance Artificial Intelligence (AI) for HPC.