Extending the Binomial Checkpointing Technique for Resilience

Event Sponsor: 
Materials Science Division Seminar
Start Date: 
Nov 4 2016 - 10:30am
Building 240/Room 1406-1407
Argonne National Laboratory
A. Walther
S.H.K. Narayanan
Paul Hovland

In terms of computing time, adjoint methods offer a very attractive alternative to compute gradient information, required, e.g., for optimization purposes. However, together with this very favorable temporal complexity result comes a memory requirement that is in essence proportional with the operation count of the underlying function, e.g., if algorithmic differentiation is used to provide the adjoints. For this reason, checkpointing approaches in many variants have become popular. We analyze an extension of the so-called binomial approach to cover also possible failures of the computing systems. Such a measure of precaution is of special interest for massive parallel simulations and adjoint calculations where the mean time between failure of the large scale computing system is smaller than the time needed to complete the calculation of the adjoint information. We describe the extensions of standard checkpointing approaches required for such resilience, provide a corresponding implementation and discuss first numerical results.