Optimization of MPI Collectives on GPU Clusters

Lukasz Wesolowski
Seminar

Abstract: Collective communication is a key aspect of large scale parallel application performance, and is becoming increasingly important as the number of cores on a supercomputer is approaching a million. Although optimization of collectives is a well studied topic and implementations of collectives in major MPI implementations are mature, the recent trend toward supercomputers with accelerator devices presents a less familiar issue of collectives on data residing in memories of accelerator devices. This work examines the question of how to optimize MPI collectives on data distributed across system and accelerator memories. The work is done in the context of the MPICH implementation of MPI running on a cluster accelerated by CUDA GPUs. Our implementation of collectives integrates efficient data movement between system and accelerator memories and schedules MPI inter-process communication with awareness of processes which need to copy data from the accelerator device before participating in the collective.

Bio: Lukasz is a Ph.D. student in computer science at the University of Illinois at Urbana-Champaign, where he also received his B.S. and M.S. degrees in computer science. His areas of interest include optimization of communication patterns on supercomputers, tools and abstractions for parallel programming, general purpose graphics processing units, and large scale parallel applications. His adviser is Laxmikant Kale.