Quick Start: Using Apache Spark for Large-Scale Data Processing

Start Date: 
Mar 27 2019 - 11:00am
Xiao-Yong Jin, Argonne Leadership Computing Facility
Speaker(s) Title: 
Assistant Computational Scientist

ALCF Developer Sessions - March 2019


Apache Spark provides a high-level framework for parallel data processing, with built-in support for fault tolerance and data replication. It has become increasingly popular in cloud computing environments and commercial data centers due to its ease of use and integration with existing libraries in Java, Scala, Python, R, and SQL. The ALCF supports Apache Spark on its Cooley and Theta systems, with plans to support the framework on its future systems as well. This webinar will present a brief tutorial of Apache Spark, provide instructions on running Apache Spark on ALCF systems, discuss the unique characteristics of Theta, and recommend a few tuning parameters for running Apache Spark with better performance on Theta.

About the Speaker

Xiao-Yong Jin is an Assistant Computational Scientist in Argonne's Computational Science Division and at the Argonne Leadership Computing Facility. He obtained his PhD in Lattice Field Theory studying quarks and gluons on a rack of the QCDOC computer, and later he worked on the K computer before coming to Argonne. Recently, he has been continuing his research on Lattice Field Theory and exploring new directions of HPC applications in machine learning, big data, and quantum information science.