Log Analysis for Reliability Management in Large-Scale Systems

Ziming Zheng
Seminar

Abstract:

With the increasing scale and complexity of HPC systems, reliability is becoming critical for these systems. System logs are the primary source of information to understand and analyze system problems. Nevertheless, little study has been done on automated log analysis for HPC systems. In this talk, I will summarize our study on system logs collected from production HPC systems by exploiting data mining and statistical learning technologies.

Our work can be broadly divided into four parts: log pre-processing, online failure prediction, automatic root cause diagnosis, and reliability modeling. The work can greatly improve our understanding of faults/errors/failures arising from hardware/software components and their interactions in HPC systems, and can further facilitate the resilience research for large-scale systems.