Error Estimation for Fault Tolerance in Numerical Integration Solvers

Data corruption may arise from a wide variety of sources from aging hardware to ionizing radiation, and the risk of corruption increases with the computation scale. Corruptions may create failures, when execution crashes; or they may be silent, when the corruption remains undetected. In this seminar, we study solutions to failures and silent data corruptions for numerical integration solvers, which are particularly sensitive to corruptions. Numerical integration solvers are step-by-step methods that approximate the solution of a differential equation. Corruptions are not only propagated all along the resolution, but the solution could even diverge.

In numerical integration solvers, approximation error can be estimated at a low cost. We use these error estimates for three applications in fault tolerance.

Concerning silent data corruptions, I will demonstrate a new lightweight detector for solvers with a fixed integration step size. We mathematically showed that all corruptions affecting the accuracy of a simulation are detected by our method.

Solvers with a variable integration size can naturally reject silent data corruptions during the selection of the next integration size. I will show that this mechanism alone can miss too many corruptions, but we developed a mechanism to improve it.

Concerning failures, the classic checkpointing-restart mechanism can be a bottleneck because of data movement costs. Lossy compression constitutes a promising solution, because it reduces IO bandwidth needs. However, it is unclear what level of error from lossy compression is acceptable for a give application. Using the error estimates determined above from integration solvers is one possible way to guarantee that the compression loss is no worse than the error of the solver.

Experiments were done in PETSc on the Blues cluster with 4096 cores.

Bio:
Pierre-Louis Guhur is a master student at Ecole Normale Superieure in Cachan, France. He also studied at ETH Zurich, Switzerland. For 9 months, he has worked at Argonne as a trainee, under the supervision of Franck Cappello and Tom Peterka. Pierre-Louis was lead author on two paper submissions, one of which is accepted at Europar 2016. Previously, he had worked on tracking for medical image analysis for 8 months.

Argonne Leadership Computing Facility

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

Featured: MyALCF

Error Estimation for Fault Tolerance in Numerical Integration Solvers

06/20/2016, 5:30am CT