Enabling Efficient and Scalable Deep Learning on Supercomputers

Zhao Zhang
Seminar

Recent year observed the fusion of Deep Learning (DL) and High Performance Computing(HPC). Domain scientists are exploring and exploiting DL techniques for classification, prediction, and simulation dimensionality reduction. These DL applications are naturally supercomputing applications given the computation, communication, and I/O characteristics. In this talk, I will present two works to enable highly scalable distributed DL training. The first one focuses on Layer-wise Adaptive Rate Scaling (LARS) algorithm and its application in ImageNet training on thousands of compute nodes with the state-of-the-art validation accuracy. The second part is to enable efficient and scalable l/O for DL applications on supercomputers with FanStore, with which we are able to scale real world applications to hundreds nodes on CPU and GPU cluster with over 90% scaling efficiency.
 
Bio:
Dr. Zhao Zhang is a compute scientist at Texas Advanced Computing Center. His current research focuses scalable deep learning on supercomputers. Dr. Zhang's past work include astronomy data processing with Apache Spark,  machine learning diagnostics, and I/O optimization for many-task computing applications on supercomputers, such as Argonne's IBM Blue Gene/P. Before joining TACC, Dr. Zhang was a postdoc researcher in AMPLab and a data science fellow at Berkeley Institute for Data Science at University of California, Berkeley, working with Prof. Michael J. Franklin. He received Ph.D in computer science from University of Chicago in 2014 under supervision of Prof. Ian T. Foster.