Black-Box Problem Diagnosis in Parallel File Systems

Event Sponsor: 
Mathematics and Computer Science Division
Start Date: 
May 18 2010 - 10:00am to 11:00am
Building 240, TCS Conference Center (1404 & 1405)
Argonne National Laboratory
Michael Kasick, Carnegie Mellon University
Speaker(s) Title: 
Sam Lang

This talk will discuss black-box problem diagnosis in parallel file systems, in which we focus on automatically diagnosing different performance problems in parallel file systems by identifying, gathering and analyzing OS-level, black-box performance metrics on every node in the cluster. Our peer-comparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. We develop a root-cause analysis procedure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. We demonstrate our approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and PostMark), in both PVFS and Lustre clusters.

