Accelerating Movement of Non-contiguous Data on Hybrid MPI+GPU Environments

John Jenkins
Seminar

Graphics Processing Units (GPUs) are becoming increasingly used at the high performance computing level, enabling the acceleration of large-scale scientific and engineering simulations. Currently, there is little meaningful integration between communication libraries such as MPI and the GPU, with the prevailing model being CPU-driven execution flow and data management of the GPU for use in MPI communication routines. This model is unlikely to change as long as GPUs use a discrete memory space, which introduces numerous challenges to allow efficient communication between the different memory spaces. One particular use-case is the communication of non-contiguous data, such as column vectors, 3D array slices, etc. As a step toward a more efficient integration of MPI with discrete GPU hardware, we discuss and provide an implementation of MPI datatype processing on the GPU. We provide a parallel extension to the current MPICH "dataloops" tree-based implementation, which allows for arbitrary point-wise packing and retrieval. We evaluate the data movement under a number of MPI datatypes, losing little next to "close to metal" CUDA API functions and also demonstrate the efficiency of 3D array slice packing for halo exchange, accelerating Y-Z face transfer to the CPU by up to an order of magnitude for larger messages. Finally, we examine the effect of various types of resource contention on the GPU, identifying under which scenarios contention causes performance degradation of the packing operation or resident computation.

Bio:
John Jenkins is a second-year PhD student at North Carolina State University under Dr. Nagiza Samatova. He received his Bachelor's degree in Computer Science at Lafayette College in 2010. He is currently studying in the realm of HPC and large-scale data analytics, but hasn't yet finalized a main thrust of research.