Moving MapReduce into the Cloud: Elasticity, Flexibility and Scalability

Yanfei Guo
Seminar

MapReduce, a parallel and distributed programming model on clusters of commodity hardware, has emerged as the de facto standard for big data analytics. However, big data analytics usually require distributed computing at scale, which is often not affordable for small business. Large MapReduce clusters also suffer from issues like low hardware efficiency. Cloud computing, unlocked by virtualization technology, allows the creation and configuration of dynamic virtual clusters with elastic resource allocations. Moving MapReduce into the cloud seems to be the promising future of efficiency and affordable big data analytics. However, a simple migration of MapReduce into the cloud does not automatically exploit the benefit that cloud computing can offer. The overhead of virtualization, the architectural bottlenecks of MapReduce implementation, and the semantic gap between the MapReduce runtime and the resource manager of the cloud platform makes building efficient and scalable virtual MapReduce clusters a challenging task. To this end, we propose to address the problems with a synergistic approach of flexible MapReduce implementation and elastic resource management: designing a flexible MapReduce runtime that works more efficiently in the cloud; developing a novel resource management mechanism that fuses with the elastic resource allocation of the cloud; optimizing the Hadoop framework to provide a MapReduce runtime that scales seamlessly in the cloud.

Furthermore, we designed and developed the first fault-tolerant MapReduce system for HPC clusters. To provide a fault-tolerant MapReduce job execution environment, we developed a new job execution framework for MapReduce in HPC clusters and designed a work-conserving fault tolerance model that is compatible with HPC schedulers.

We implemented and evaluated the proposed techniques in private cloud based testbeds with PUMA MapReduce benchmark suite and Intel HiBench. For evaluating the MapReduce system for HPC clusters, we used a 320-node HPC cluster. This thesis provides novel and effective systematic solutions that improve the performance and resource utilization efficiency for MapReduce clusters in the cloud.

Bio:
Yanfei Guo is a Ph.D. candidate working on MapReduce and Cloud Computing at University of Colorado, Colorado Springs. He received B.S. in Computer Science at Huazhong University of Science and Technology (HUST), Wuhan, China in 2010. His research interests lies in Distributed and Parallel Computing, MapReduce and Big Data Processing, Cloud Computing, and Automated Computing in Virtualized Environ.