Partial Support Vector Regression to Mitigate Silent Errors in the Exascale Era

Omer Subasi
Seminar

As the Exascale era approaches, the increase in the capacity of high performance computing (HPC) systems and the targeted power and energy budget goals for these systems cause challenges in terms of reliability. In particular, silent data corruptions (SDC) or silent errors corrupt the results of HPC applications without being noticed. Consequently they become a significant threat to the correct computations of these applications.

In this work, we re-purpose and redesign epsilon-insensitive support vector machine regression to detect and correct SDCs that occur in HPC applications which can be characterized by an error bound. Our design takes spatial features, i.e. neighboring data values, into training data and as a result incurring low memory overhead. Experimental results show that our detector achieves more than 90% recall and less 1% false positive rate for most of the cases. Moreover our detector incurs low performance overhead for all benchmarks studied. Comparison to other state-of-the-art techniques indicates that our detector provides the best trade-off considering its performance and the incurred overheads.

Bio
Omer Subasi is a last-year PhD student at Barcelona Supercomputing Center and Polytechnic University of Catalonia. He got his BS degree in Computer Engineering with double major in Mathematics from Koc University, Istanbul, Turkey in 2009. He got his Master degree in Computer Science and Engineering from Koc University, Istanbul, Turkey in 2012. His research interest are reliability for high performance computing (HPC) and exascale systems, programming models, reliability modelling. He is currently working on techniques to mitigate silent data corruptions in HPC applications.