Developing and Testing Future Applications and Operating Systems for Exascale

PI Maya Gokhale, Lawrence Livermore National Laboratory
diagram
Project Description

The U.S. Department of Energy (DOE) has begun to determine the biggest problems likely to be seen on machines with 100 million cores (exascale systems). Two areas of concern are runtime systems, and adaptability.

Exascale computing systems will provide a thousand-fold increase in parallelism, challenging scalability and adaptability of software stacks and applications. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in an unreliable hardware environment with billions of threads of execution.

Project researchers will develop systems software and runtime support of a new approach to the data and work distribution based on task queues and dynamic adaptation for load balancing and fault mitigation. The project work includes adaptive, application tailored OS services optimized for multi-to-many core processors; advanced virtualization infrastructure; a distributed, fault tolerant key-value store that is the basis for task and data distribution; and fault tolerant load balancing schemes for massively task parallel applications.

Project researchers will test:

  • Quantum chemistry kernel implementations using a work queue model (PNNL, Sandia, and OSU)
  • An adaptive load balancing task distribution library (PNNL, OSU)
  • Scalable implementations of a distributed data store (Boston University)
  • New OS virtualization environments and runtime (Boston University)
  • Asynchronous graph traversal algorithms based on a distributed work-queue model (LLNL)
Allocations