System Monitoring, Diagnosing, and Predicting for Extreme-Scale Computing

Zhiling Lan
Seminar

As supercomputers continue to grow in size and complexity, system mean-time-between-failure (SMTBF) decreases dramatically, resulting in more frequent system-wide interrupts. Studies have shown that SMTBF for production systems are only in the order of 10-100 hours, even for systems based on ultra-reliable components.  Extrapolating from the current performance and scale, the SMTBF may fall to only 1.25 hours of useful computation on exascale systems. In this talk, I will present our on-going research work to address the resilience problem from three aspects: (1) online failure prediction, (2) automatic failure diagnosis, and (3) online data filtering. Fundamentally, our approach explores advanced data analytics technologies from information fusion, data mining/machine learning, and statistical modeling that utilize HPC-specific domain knowledge. I will also present case studies of applying our work to real system logs collected from various production HPC systems including the Blue Gene/P system at Argonne. This work is conducted in collaboration with several national laboratories including Argonne, Sandia, and ORNL.

Bio:
Dr. Zhiling Lan is an associate professor of Computer Science at Illinois Institute of Technology. She received her PhD degree in Computer Engineering from Northwestern University in 2002. Her research interests are in the area of high performance computing, in particular, fault tolerance, resource management and scheduling, and performance analysis and modeling. She has authored/coauthored over 50 papers in leading referred journals and conferences. One of her recent papers on job scheduling received a Best Paper Aware at IPDPS’10. Her work on online failure prediction for supercomputers has been selected as the Top-10 Data Mining Case Studies at the 10th ICDM conference. More information about her and her research can be found at http://www.cs.iit.edu/~lan.