System Reliability at Scale: Challenges, Insights, and Opportunities

Mathematics and Computing Science Seminar
Oct 19 2015 - 10:30am
Building 240/Room 4301
Argonne National Laboratory
Devesh Tiwari
Oak Ridge National Laboratory
Franck Cappello

System reliability is one of the major exascale challenges. Lower system reliability can lead to lower scientific productivity and operational efficiency. Unfortunately, traditional resilience techniques such as checkpoint/restart impose significant performance and I/O overhead. As the system components continue to increase and applications become more complex, the overhead will increase even further. Furthermore, adoption of new computing devices at large scale (e.g., GPU and SSD) require developing new understanding and devising new mitigation strategies.

In this talk, Devesh will share the insights that we have learned from the Titan supercomputer, the fastest supercomputer in the USA, and the innovative techniques that were designed to overcome some of the challenges. Devesh will show how the temporal and spatial characteristics of system errors can be exploited to improve the overall efficiency of the large-scale computing facilities. Finally, Devesh will discuss some of the open problems in this domain, and possible approaches toward solving these problems.
Devesh Tiwari is a Research Scientist in the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory. He primarily focuses on making the large-scale computing facilities more energy efficient and reliable while expediting the process of scientific discovery. Devesh graduated with a Bachelor degree in Computer Science and Engineering from Indian Institute of Technology (IIT) Kanpur, India.  Following that, he got the PhD degree in Computer Engineering from North Carolina State University. His performance modeling, reliability and energy-efficiency focused research works have been covered by media (including slashdot and HPCWire) and published at several computer systems conferences including USENIX FAST, Supercomputing (SC), DSN, HPCA, IPDPS, ICAC. His papers have earned best paper award nominations at SC, DSN, and IPDPS conferences.  More information can be found here: