A similarity based distributed checkpointing algorithm for reducing checkpoint image size in parallel computing systems

Abstract:
With the ever increasing number of components in high-end computing systems, component failures have become common place rather than exceptional. Currently, the most prevalent approach to overcome such failures on large scale machines is to periodically checkpoint the applications to stable storage. It has been reported that the overhead imposed by the checkpoint operation is already about 20% of the overall execution time and it is anticipated that in the coming years this ratio may reach over 50%. The substantial overhead introduced by checkpointing stems from the fact that there is a significant imbalance between the computational power and the available I/O bandwidth in current high-end computing systems. Reducing checkpoint image size without significant computational overhead is therefore a major concern.

In this talk, we first present a similarity based checkpointing mechanism in the context of virtual machine replication and show how the same approach may be adapted to large scale HPC systems. We detail our design and implementation of a distributed checkpoint compression algorithm which aims at finding similar patterns within the memory content of the whole parallel application. We show preliminary results on the degree of similarity we found and the computational overhead of our proposed compression method on various applications running on BlueGene/P.

Bio:
Balazs Gerofi is a second year Ph.D student at The University of Tokyo. His main interest involves operating systems and fault tolerant computing. He received his MS degree in Computer Science of Vrije Universiteit of Amsterdam in 2006.

Argonne Leadership Computing Facility

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

Featured: MyALCF

A similarity based distributed checkpointing algorithm for reducing checkpoint image size in parallel computing systems

09/10/2010, 5:30am CT