GPU Accelerated Task Parallelism in a Global Address Space Model

Humayun Arafat
Seminar

In this work, I will present an approach to accelerate task parallel computations using GPUs in the context of the Global Arrays parallel programming model. Task parallelism is an efficient technique for expressing parallelism in irregular programs. We extend the Scioto task parallel programming library to efficiently offload task execution to GPU accelerators. The execution of Scioto tasks on the GPU requires movement of data through three layers: the global address space, host memory, and device memory. We propose an automated, pipeline-based approach for handling the movement of data through these memory spaces. Data transfer is made transparent to the user, providing opportunities to hide overheads through optimizations like pipelining. On-device caching and task sequencing are also leveraged to exploit data locality. We evaluate our work using a block-sparse tensor contraction kernel. Tensor contractions, which are generalized multidimensional matrix multiplication, are widely used in quantum chemistry. Experiments show that the proposed techniques yield significant performance gains by hiding the cost of data movement.

Bio: Humayun Arafat is a PhD student from the Dept. of Computer Science and Engineering at The Ohio State University. He is advised by Prof. P. Sadayappan. He has received Bachelor degree from Bangladesh University of Engineering and Technology. Arafat's research interest is on high performance and parallel computing. His previous work includes improving load balancing of Dynamic Nucleation Theory Monte Carlo(DNTMC), which is used in NWCHEM.