Extending Transparent Checkpointing to Hybrid Architectures: The Split-Process Approach

Gene Cooperman, Northeastern University

Transparent or system-level checkpointing refers to the ability to checkpoint MPI or other applications to stable storage, without modifying the binaries of that target application. The DMTCP (Distributed MultiThreaded CheckPointing) project for transparent checkpointing began in 2004. DMTCP's flexible plugin model has been used to integrate DMTCP with application-specific checkpointing using the DOE VeloC software. DMTCP is also widely used, with over 100 refereed publications describing their use of DMTCP. Another use case is the continuing five-year collaboration with Intel Corp.

For scalability, DMTCP has been shown to robustly checkpoint HPCG and NAMD, running on up to 32,000 CPU cores. Another extension allows for transparently checkpointing NVIDIA CUDA using UVM (Unified Virtual Memory), with future plans to support Kokkos. More recently, a new split-process model, "MANA for MPI", was developed. MANA checkpoints an MPI application under a given MPI library, network interconnect, and cores-per-node; and then restarts under a different configuration for different hardware. The key to MANA is that the MPI application code is isolated to "upper-half memory", while the MPI/network libraries reside in "lower-half memory". Only the upper half is checkpointed, and yet the tight coupling delivers low runtime overhead (e.g., less than 1% for GROMACS). The talk concludes with speculation on a future split-process approach for the Nemesis component of MPICH. Isolating Nemesis to a lower half memory would enable simple and robust, switching at runtime among the alternative communication methods of Nemesis.

Miscellaneous Information: 

This Seminar will be streamed. See details at https://anlpress.cels.anl.gov/cels-seminars/

Please click here [schedule.ics] to add this event to your calendar.