Improving the Trustability and Usability of Scientific Data: From Error Detection to Lossy Compression

Franck Cappello
Seminar

The first exascale systems will arrive in the United States in 2021. Other systems will follow with even higher computing and data analytics capabilities. At these scales, faults are frequent, and their consequences (errors) can lead to execution failure (the application crashes) or even worse: incorrect results (due to corruptions). Although generic and optimized techniques exist to tolerate failures and corruptions, the resources available on extreme-scale systems are limited and devoted mainly to the computation. Only a small portion of these resources can be used for fault tolerance. In such a scarce environment, innovative fault tolerance approaches are needed that are effective but also optimized for limited resource consumption.

In the course of the LDRD Paris project, we have developed new effective and low-overhead fault tolerance techniques by focusing on the scientific data computed by the applications. We have developed a series of silent data corruption detectors that are capable of detecting 99% of significant corruptions, imposing only 5% computational overhead on the application. These results are to be compared with the 100% (or more) overhead needed by the classic corruption detection technique (replication). We have also developed a series of lossy compression algorithms that can reduce scientific data sets by factors of 5x or even by orders of magnitude while respecting a user-defined pointwise accuracy. Such compression algorithms can reduce checkpoint time by 20x.

Our results in the two main areas of fault tolerance for scientific computing show that exploiting knowledge from data is an important source of performance improvements for scientific applications. The LDRD Paris project had numerous impacts in research (including a follow-up NSF project and extensions that are under way by several teams worldwide: PNNL, Barcelona Supercomputing Center, IIT, ENS-Lyon) and development (the compressor is part of the DOE-NNSA Exascale Computing Project).

Speakers Bio: Franck is senior computer scientist at Argonne National Laboratory and adjunct associate professor in the department of computer science at University of Illinois at Urbana Champaign. He is the director of the Joint-Laboratory on Extreme Scale Computing gathering six of the leading high performance computing institutions in the world: Argonne National Laboratory (ANL), National Center for Scientific Applications (NCSA), Inria, Barcelona Supercomputing Center (BSC), Julich Supercomputing center (JSC) and Riken AICS. Franck is an expert in resilience and fault tolerance for scientific computing and data analytics. Recently he started investigating lossy compression for scientific datasets to respond to the pressing needs of scientist performing large scale simulations and experiments. Franck is member of the editorial board of IEEE Transactions on Parallel and Distributed Computing and of the ACM HPDC and IEEE CCGRID steering committees. He is fellow of the IEEE.