Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints using Nek5000

Michel Schanen
Seminar

Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. Essential to their performance is an efficient checkpointing scheme recovering the primal values in the adjoint run, this being a trade-off between memory requirement and recomputation. An asynchronous two-level adjoint checkpointing scheme has been implemented for multistep numerical time discretizations targeted at large-scale numerical simulations. The checkpointing scheme combines both bandwidth-limited disk checkpointing and binomial memory checkpointing. Based on assumptions about the target petascale system, 50k+ cores on the IBM Blue Gene/Q system Mira, this checkpointing approach and the performance model has been validated using the spectral element solver Nek5000.