Scalable System Software for Parallel Programming

PI Name: 
Robert Latham
PI Email:
Argonne National Laboratory
Allocation Program: 
Allocation Hours at ALCF: 
20 Million
Research Domain: 
Computer Science

The purpose of this project is to improve the performance and productivity of key system software components on leadership class platforms. High-end computing systems such as those being deployed at the Argonne Leadership Computing Facility consist of hundreds of thousands of processors, terabytes of memory, exotic high-speed networks, and petabytes of storage. While the capabilities of such machines have grown rapidly over the past several years, such growth comes at the cost of growing system complexity.

To keep individual component costs from overshadowing overall system cost, modern architectures are increasingly relying on hardware sharing that includes shared caches, shared memory and memory management devices, and shared network infrastructure. Multi-core architectures, simultaneous multi-threading capable processors, and flat torus-like network architectures are some examples of the hardware sharing that is becoming increasingly prevalent. As hardware complexity skyrockets in leadership class systems, it is not easy for applications to take complete advantage of the available system resources and avoid potential bottlenecks.

Specifically, the researchers propose studying four different classes of system software: Message Passing Libraries to increase productivity while achieving high performance; Parallel Input/Output (I/O) to manage the complexity of computational hardware for high performance and providing effective interfaces for scientific application data models; Data Analysis and Visualization to target the post-processing and co-processing of computed data in addition to simulation efficiency; and Operating System to more effectively manage the growing numbers of nodes on leadership platforms, especially for many task computing (MTC) and high throughput computing (HTC) jobs, which require more advanced caching of input and output data to offer adequate performance.