Leveraging Many GPUs in Parallel for Lattice QCD

Ron Babich, Boston University
Seminar

Quantum Chromodynamics (QCD) is the fundamental theory that describes the interactions of quarks and gluons. In the lattice formulation of QCD, the equations governing these interactions are solved numerically on a four-dimensional space-time grid, a process that it is carried out in two basic stages. The first stage, "lattice generation," typically demands leadership-class machines, while the subsequent "analysis" stage is often farmed out to smaller resources, each sustaining perhaps a Tflop/s. Recently, graphics processing units (GPUs) have proven to be extremely cost-effective for many workloads in lattice QCD, especially the latter "analysis" stage of large-scale computations. As a result, it is now possible to tackle much more challenging problems than would have been possible using more traditional architectures.

This talk will focus on lattice QCD as a case study in GPU computing, drawing lessons that might have relevance for other fields. In particular, I will discuss "QUDA," a library of linear solvers tailored for QCD and developed using NVIDIA's "C for CUDA" API. Recently, QUDA has been enhanced to support the use of multiple GPUs in parallel with communications handled via MPI. I will discuss some of the strategies we employed to obtain high performance on a single GPU and maintain efficiency on GPU-enabled clusters. Finally, I will describe some of the challenges we face as week seek to scale up to systems consisting of hundreds or even thousands of GPUs, with possible consequences for the design of future heterogeneous systems on the road to the exascale.