Periodic I/O scheduling for supercomputers

Event Sponsor: 
Argonne Leadership Computing Facility Seminar
Start Date: 
Jul 14 2017 - 10:30am
Building/Room: 
Building 240/Room 4301
Location: 
Argonne National Laboratory
Speaker(s): 
Guillaume Aupy
Speaker(s) Title: 
Inria Bordeaux
Host: 
Venkat Vishwanath

With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in super-computers. Architectural enhancement such as burst-buffers and pre-fetching are added to machines, but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.
   
In this work, we show how to take advantage of the periodic nature of HPC applications in order to develop efficient periodic scheduling strategies for their I/O transfers.
 
Our strategy computes once during the job scheduling phase a pattern where it defines the I/O behavior for each application, after which the applications run independently, transferring their I/O at the specified times. Our strategy limits the amount of I/O congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments by comparing it to state-of-the-art online solutions.
 
Short Bio:
I am currently a tenured research scientist (CR2) in the Tadaam Team at Inria Bordeaux - Sud Ouest. My main focus is on data movement at different levels (I/O, cache, network). Until November 2016, I was a Research Assistant Professor at Vanderbilt University (and before this at Penn State University) where I worked with the Scalable Computing Lab headed by Padma Raghavan. Before this, I have done a brief postdoc in Argonne National Laboratory (working with Paul Hovland) and used to be a PhD student with Anne Benoit and Yves Robert in the ROMA team at the LIP.

I am interested in any new scheduling problems (after playing for a while with energy and reliability related problems, I started working on automatic differentiation), but also algorithmic problems such that finding good approximation algorithms. I have further done some work on checkpointing strategies for HPC systems.