Scalable Fault Tolerance at the Extreme Scale

Zizhong "Jeff" Chen
Seminar

Extreme scale supercomputers available before the end of this decade are expected to have 100 million to 1 billion computing cores. Due to the large number of components involved, extreme scale scientific applications must be protected from errors.  When an error occurs, the affected application either continues or stops. If the application continues, we call it a fail-continue error. Otherwise, we call it a fail-stop error. In this talk,  In this talk, I will discuss our recent work on scalable fault tolerance at the extreme scale.  We have developed some highly efficient techniques for selected widely used scientific algorithms to tolerate both fail-continue and fail-stop errors according to their specific algorithmic characteristics. The algorithms we consider include direct methods for solving dense linear systems and eigenvalue problems, iterative methods for solving sparse linear systems and eigenvalue problems, and Newton's method for solving systems of non-linear equations and optimization problems. By leveraging the algorithmic characteristics of these algorithms, the proposed techniques can achieve much higher efficiency than the traditional general techniques (i.e., Triple Modular Redundancy for fail-continue errors and checkpoint for fail-stop errors) and therefore have potential to scale to exascale and beyond. A highly scalable checkpointing scheme is also developed for general applications.

Bio:
Zizhong Chen is an Assistant Professor of Computer Science at the University of California, Riverside.  He specializes in high performance computing and numerical linear algebra algorithms and software.  His specific research topics include algorithm-based fault tolerance for matrix operations, fault tolerant and energy efficient linear algebra software,  highly scalable checkpoint techniques for high performance scientific applications,  performance and resilience issues of message passing interface (MPI) libraries, floating-point erasure and error correction codes,  random matrices and their applications in high performance computing. He is a Senior Member of the IEEE and a recipient of the U.S. National Science Foundation CAREER Award.  He serves as an Associate Editor for the Elsevier Parallel Computing journal.