An Architecture for Scaling Spark on HPC Systems

Event Sponsor: 
Mathmatics and Computer Science Division Seminar
Start Date: 
Dec 1 2017 - 10:30am
Building 240/Room 4301
Argonne National Laboratory
Silvina Caíno Lores
Speaker(s) Title: 
University of Madrid
Tom Peterka

The growing demand for efficient data analysis and visualization of modern HPC applications has increased ​​efforts to leverage data-centric frameworks from the Big Data ecosystem. Nevertheless, we need to adapt these plat​​forms to get the full benefits of HPC infrastructures. This seminar explores the possibility to layer the popular Spark ​application model​ and its higher level tools (e.g. GraphX, Streaming, Mllib) on top of a highly scalable MPI-based library (DIY). We will present an architecture that​ maps the RDD data abstraction of Spark​ and its task-oriented execution model, with the block-based nature of DIY​​ ​and the underlying fabric​ of​ MPI processes. As a result, the data-intensive communication patterns implemented in DIY are transparently supported with minor additions to the Spark programming model.

Silvina Caíno-Lores is a PhD candidate at Carlos III University of Madrid under the supervision of Prof. Jesús Carretero and Prof. Florin Isaila.