Optimization and Usage of Lossy Compression for Scientific Applications

Dingwen Tao
Seminar

With ever-increasing volumes of scientific data produced by HPC applications, significantly reducing the data size is critical because of the limited capacity of the storage space and the potential bottleneck on the I/O or network bandwidth in writing/reading the data. In this presentation, (1) we first introduce our proposed meta compression framework that can deal with existing error-controlled lossy compressors exhibit largely different compression qualities across different data sets. Our framework can predict the lossy compression quality for a series of compatible compression techniques in early processing stages and thus select the best fit strategy for each data set. We investigate the principles of many state-of-the-art lossy compressors, analyze their pros and cons in detail, and develop a generic compressor based on our proposed meta compression framework, optimizing the lossy compression quality for HPC scientific datasets. Our evaluation results in a parallel environment with 1,024 cores show that our solution can improve the compression ratio by 20% because of very accurate selection (around 99%) of the best compressor, with little analysis cost (less than 7% in experiments). (2) We then introduce our novel execution framework that adopts lossy-compressed checkpoints to significantly improve the overall fault tolerance performance for iterative methods. We formulate a lossy checkpointing performance model and derive an upper bound for the extra number of iterations caused by the compression errors in lossy checkpoints against the reduced checkpointing overheads. We analyze the impact of lossy checkpointing (i.e., extra number of iterations) on multiple types of iterative methods. Our experiments with 2,048 cores show that our optimized lossy checkpointing framework can significantly improve the overall performance for iterative methods by 40+% compared with traditional checkpointing and 20+% compared with lossless compressed checkpointing, in the presence of system failures.