A Holistic Approach to Resiliency Exploiting Applications Structure and Hierarchy

Event Sponsor: 
Computation Institute Presentation
Start Date: 
Nov 12 2015 - 12:00pm
Searle 240A
The University of Chicago, 5735 S. Ellis Ave., webcast via Blue Jeans (see below)
Anshu Dubey
Speaker(s) Title: 
Argonne National Laboratory - MCS
Ian Foster

Resilience is a growing concern for large-scale simulations. As failures become more frequent, alternatives to global checkpointing that limit the extent of needed recovery become more desirable. Additionally, platforms differ in both error rates and types, therefore, a flexible and customizable recovery strategy can be very helpful to the applications running on these platforms. Applications often have structures that provide logical confinement spaces that can be exploited for this purpose. We investigate a customizable recovery strategy in the context of structured adaptive mesh refinement (SAMR). We exploit the inherent granularities and hierarchy in SAMR to limit the impact of faults for localized recovery, and identify tunable parameters for customizing the strategy depending upon the application and platform behavior. We use Global View Resilience (GVR) library, which provides global versioning arrays for application-controlled state saving as our resiliency interface.

Anshu Dubey received her Ph.D. in computer science from Old Dominion University in 1993. She then joined the University of Chicago Astronomy & Astrophysics Department as a research associate. In 2001 she joined the ASC/Flash Center where she was associate director from 2009-2013. From 2013 to 2015 she was on the staff at Lawrence Berkeley National Laboratory in the Applied Numerical Algorithms Group. In 2015 she joined the Mathematics and Computer Science Division at Argonne as a computer scientist. She has two decades of experience in development of complex scientific software and has earned wide recognition for her contributions.

Miscellaneous Information: 

Lunch will be provided in Searle conference room 240A.

Webcast: This talk will be broadcast to Argonne National Laboratory, TCS Building 240, Room 5172. You may join the broadcast via Blue Jeans at http://tinyurl.com/TCS-CI. You will have to install and approve a browser plug-in. Upon entering the meeting, please select "Don't Send, Only Receive".