Study of the Behavior of HPC Applications under Checkpointing

Shu-Mei Tseng
Seminar

This talk presents the experiences and lessons learned in studying the behavior of several HPC applications with and without checkpointing. After a brief introduction to checkpointing, it focuses on several approaches to profile and extract behavior patterns with respect to CPU, memory and network utilization. These approaches are tailored for HPC machines such as Theta, where monitoring certain resources such as networking is non-trivial. In the second part, the talk focuses on the results obtained by applying the approaches to two representative HPC applications (HAC C, LatticeQCD). In particular, it discusses several findings related to periodicity of resource utilization and interference. It concludes with a series of future directions of research that exploit the findings.