HPC Colony: Adaptive System Software for Improved Resiliency and Performance

PI Name: 
Terry Jones
PI Email: 
Oak Ridge National Laboratory
Allocation Program: 
Allocation Hours at ALCF: 
3 Million
Research Domain: 
Computer Science

HPC Colony, a community code project funded by the Office of Science, received three million hours to test newly developed system software essential for harnessing the power of next-generation exascale supercomputers and for making these leading-edge resources more accessible to researchers throughout the scientific community.

Operating and runtime systems provide mechanisms to manage system hardware and software resources for the efficient execution of scientific applications. They are essential to the success of both high performance systems and complex applications. By the end of this decade, exascale computers with unprecedented processor counts and complexity will require significant new levels of scalability and fault management. As system software designed decades ago for single-user use is stretched into new needs for parallel applications running on machines consisting of hundreds of thousands of cores, supercomputing faces significant challenges. Significant improvements are required to enable a broader set of applications.

HPC Colony adapts the operating systems to the needs of applications rather than having applications constantly retooled for tomorrow’s leadership class machines. This means that we must effectively deal with fault resilience issues and load balancing issues without reducing the capabilities of the development environment. Further, HPC Colony leverages and preserves time-tested and effective parallel program development practices (e.g., asynchronous I/O, overlapped communications and computation, efficient thread spawning and threads-based tools) without incurring the problems normally associated with such practices when they are applied to petascale environments and beyond. Finally, HPC Colony provides much-needed support by removing the onerous burden of explicitly handling load imbalance issues.