A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing

Event Sponsor: 
Argonne Leadership Facility Seminar
Start Date: 
Jul 7 2008 (All day)
Building 221 Conference Room A216
Argonne National Laboratory
Gaurav Khanna
Speaker(s) Title: 
Computer Science and Engineering Department,The Ohio State University

Science is increasingly becoming more and more data-driven. With technological advancements such as advanced sensing technologies that can rapidly capture data at high resolutions and Grid technologies that enable increasingly realistic simulation of complex numerical models, scientific applications have become very data-intensive and involve storing and accessing large amounts of data. The LHC experiment at CERN is an example of a high-energy physics initiative where the amount of data stored is in petabytes. The end goal in collecting petabytes of simulation data is to gain a better understanding of the problem under study. This essentially involves collaborative analysis of data by scientists across the world which conforms to a distributed data-intensive computing paradigm where a set of compute, storage and network resources are used in a collective fashion to advance science. Effective scheduling and resource management for such data-intensive applications on distributed resources is critical in order to meet their performance requirements.

Efficient scheduling in the aforementioned scenario encompasses two key inter-related problems. The first one is the data staging problem which involves the staging of data from the simulation/experimental sites to the computational sites where the data analysis needs to be performed. The second one is the job mapping problem which involves the mapping of data analysis jobs to compute resources in such a manner so as to maximize the locality of data usage.

Traditional batch job schedulers are designed for compute-intensive jobs running at supercomputer centers. They take into account CPU related metrics (e.g., user estimated job run times) and system state (e.g., queue wait times) to make scheduling decisions, but they do not take into account data related metrics. Therefore, there is a need for designing scheduling mechanisms for data-analysis jobs that take into account not only the computation time of the jobs, but also the overheads of retrieving files requested by those jobs.

In our work, we address the problem of data staging and job mapping for data-intensive jobs in both homogeneous and heterogeneous environments. We achieve this by taking into account the effects of data staging, end-point contention and data locality. For the mapping problem, we propose algorithms for mapping data-intensive jobs in both an offline and online setting. We also study the interplay between job mapping and data replication and propose algorithms which perform job mapping and data replication in a coordinated manner. For the data staging problem, we propose efficient data staging mechanisms for data centers consisting of coupled collections of storage and compute clusters. Furthermore, we extend our data staging work to a heterogeneous distributed system like the Grid. To accomplish that, we employ multi-hop path splitting and multi-pathing optimizations to improve wide-area file transfer throughput.

Miscellaneous Information: