Fault Tolerant MapReduce-MPI using User Level Failure Mitigation

MapReduce-MPI (MR-MPI) provides an easy way for running MapReduce job on HPC clusters. But it is missing the fault tolerance feature which MapReduce is designed to provide. Due to the lacking of effective way to detect and recover from a failure, it is hard to provide a fault tolerant MR-MPI. This talk will present my research on the development of fault tolerant MapReduce-MPI for Big Data in HPC. We designed two different models to achieve fault tolerance in MR-MPI: checkpoint-restart and detect-resume.

The checkpoint-restart model is implemented using the current MPI standard. It allows a MapReduce job to be saved when failure happens and restart from the last successful checkpoint when resubmitted. The detect-resume model exploits the User Level Failure Mitigation (ULFM) to isolate the failed process and fixed the failed communicator, so that the MapReduce job can keep running. We redesigned the architecture and rewritten most of the codes of MR-MPI to support these fault tolerance features. We evaluate our implementation using representative MapReduce jobs to study the its performance and overhead.

Biography:
Yanfei Guo is a Ph.D. candidate working on MapReduce and Cloud Computing at University of Colorado, Colorado Springs. He received B.S. in Computer Science at Huazhong University of Science and Technology, Wuhan, China in 2010. His research interests lies in MapReduce and Big Data Processing, Cloud Computing, and Automated Computing in Virtualized Envrionments.

Argonne Leadership Computing Facility

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

Fault Tolerant MapReduce-MPI using User Level Failure Mitigation

12/12/2014, 4:30am CT