Improving Data Loading and Communication Performance for Large-Scale Distributed Training

Baixi Sun, Indiana University Luddy School of Informatics, Computing, and Engineering, and Summer Student at ANL 2023
Webinar
CS Seminar Graphic

Large-scale distributed training of Deep Neural Network (DNN) models reveals performance issues on High-Performance Computing (HPC) clusters. On the one hand, the effectiveness of DNN models heavily depends on large training datasets (e.g., Terabyte-scale), making data loading challenging in today's distributed training. On the other hand, second-order optimizers offer improved convergence and generalization in DNN training but come with extra data communication overhead compared to stochastic gradient descent (SGD) optimizers. Therefore, reducing communication costs is crucial for the performance of second-order optimizers.

To address these problems, I will discuss two system-level optimizations: SOLAR and SSO. SOLAR utilizes offline and online scheduling strategies to optimize the data loading cost from parallel filesystems to device memory (e.g., Graphics Processing Unit - GPU). SSO avoids latency-bounded communication and integrates lossy compression algorithms to reduce communication message size while preserving the benefits of second-order optimizers, such as faster convergence compared to SGD-based optimizers. Specifically, I will describe the challenges of data loading and communication in large-scale distributed training, share our insights on performance improvements, and explain how SOLAR and SSO address these challenges and issues.