Building an Integrated Big Data Analysis Platform for Genomic Sciences and Addressing the Resource Management Challenges in the Cloud

Wei Tang
Seminar

Next-Generation Sequencing (NGS) has cut the DNA sequencing cost dramatically and thus shifted the bottleneck of genomic sciences from data generation to data analysis that requires increasing computing capacities. Meanwhile, the consequent data deluge has imposed challenges for several human roles in computational genomics sciences, including bioinformatics tool developers, workflow builders, data analysis service operators, and computing resource administrators.To address these problems, we have developed an integrated platform, comprising Shock data management system and AWE workload management system, which supports reusable sequence data management and accelerated workflow executions on scalable, distributed computing resources. With Shock/AWE, we have ported the MG-RAST pipeline, a popular metagenome analysis service, into the cloud and achieved scalable throughputs. However, resource management challenges exist in the cloud, especially when data movement between multiple sites plays an important role.

In this talk, I will first talk about the data deluge problems in genomic sciences and our open-source data analysis platform supporting an integrated management for applications, services, data, and computing resources. Then, I will talk more about the resource management aspect, describing the observed problems and our efforts to address them, including 1) MG-RAST workload characterization to understand the application needs for the cloud, and 2) a data-aware distributed workflow scheduling mechanism, along with a workflow simulator on top of CODES/ROSS simulation framework, which can provide effective capacity planning and task allocation in multi-cloud environments.