HPC Colony: Adaptive System Software for Improved Resiliency and Performance

Project Description

HPC Colony, a community code project funded by the Office of Science, received three million hours to test newly developed system software essential for harnessing the power of next-generation exascale supercomputers and for making these leading-edge resources more accessible to researchers throughout the scientific community.

Operating and runtime systems provide mechanisms to manage system hardware and software resources for the efficient execution of scientific applications. They are essential to the success of both high performance systems and complex applications. By the end of this decade, exascale computers with unprecedented processor counts and complexity will require significant new levels of scalability and fault management. As system software designed decades ago for single-user use is stretched into new needs for parallel applications running on machines consisting of hundreds of thousands of cores, supercomputing faces significant challenges. Significant improvements are required to enable a broader set of applications.

HPC Colony adapts the operating systems to the needs of applications rather than having applications constantly retooled for tomorrow’s leadership class machines. This means that we must effectively deal with fault resilience issues and load balancing issues without reducing the capabilities of the development environment. Further, HPC Colony leverages and preserves time-tested and effective parallel program development practices (e.g., asynchronous I/O, overlapped communications and computation, efficient thread spawning and threads-based tools) without incurring the problems normally associated with such practices when they are applied to petascale environments and beyond. Finally, HPC Colony provides much-needed support by removing the onerous burden of explicitly handling load imbalance issues.

Argonne Leadership Computing Facility

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

HPC Colony: Adaptive System Software for Improved Resiliency and Performance

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

HPC Colony: Adaptive System Software for Improved Resiliency and Performance

More Computer Science Projects

Democratizing AI by Training Deployable Open-source Language Models

Enabling Resilient and Portable Workflows from DOE’s Experimental Facilities

Developing High-Fidelity Dynamic and Ultrafast X-ray Imaging Tools for APS-Upgrade