An Architecture for Scaling Spark on HPC Systems

Silvina Caíno Lores
Seminar

Abstract:
The growing demand for efficient data analysis and visualization of modern HPC applications has increased ​​efforts to leverage data-centric frameworks from the Big Data ecosystem. Nevertheless, we need to adapt these plat​​forms to get the full benefits of HPC infrastructures. This seminar explores the possibility to layer the popular Spark ​application model​ and its higher level tools (e.g. GraphX, Streaming, Mllib) on top of a highly scalable MPI-based library (DIY). We will present an architecture that​ maps the RDD data abstraction of Spark​ and its task-oriented execution model, with the block-based nature of DIY​​ ​and the underlying fabric​ of​ MPI processes. As a result, the data-intensive communication patterns implemented in DIY are transparently supported with minor additions to the Spark programming model.

Bio:
Silvina Caíno-Lores is a PhD candidate at Carlos III University of Madrid under the supervision of Prof. Jesús Carretero and Prof. Florin Isaila.