**Tier 2 Code Development Project**

**Numerical Methods/Algorithms**

Two codes are involved in this project: MILC and the Columbia Physics System (CPS). The application operates on a four-dimensional hypercubic lattice (grid) with quark data as- associated with the lattice sites and gluon data associated with the links between neighboring sites. The team’s calculations work with gluon field configurations that can be thought of as snapshots of the QCD vacuum. They generate and store large statistical ensembles of these configurations, which form the basis for all of their physics studies. The most time-consuming part of the team’s projects is usually the solution of a large sparse matrix problem. This is done using variants of the conjugate gradient or biconjugate gradient algorithm. Refinements include various preconditioning schemes. Another time-consuming part of the calculation appears in the generation of the gluon field configurations themselves. The process involves solving a pseudo-dynamical system in a manner similar to molecular dynamics. This generation of gauge field configurations has become increasingly sophisticated to permit calculations which use a physical light quark mass and large physical volumes. This is a multiscale problem requiring the inversion of poorly conditioned sparse matrices and evolving a billion degrees of freedom that change on very different time scales. The HLbL calculation also uses the ensembles of gauge configurations described above but departs in important respects from the usual lattice QCD calculation because it also involved an electromagnetic (EM) gauge field as well as the muon field.

**Parallelization**

High-level CPS code is written in C++ and uses MPI between nodes and OpenMP on each node. Thus, 64 threads are typically used on a 16-core Blue Gene/Q node. On the Blue Gene machines, the researchers use SPI for internode communication and would use more performant, low-level network software if it were available on Theta. The overall code base uses a variety of methods to implement parallelism, vectorization, and otherwise highly optimize for specific architectures. These include the BAGEL code generator, which, for example, generates KNC assembly language for running on KNC. Another method is highly tuned libraries, such as the USQCD SciDAC library, and the QUDA library for GPUs.

**Application Development**

- MILC
- Optimize sparse matrix solver for HISQ formulation
- Optimize calculation of quark force and gluon force, and link smearing
- OpenMP directives, experiment with data layouts for better vectorization

- The team expects to be memory bandwidth limited, so optimize use of MCDRAM, in Flat Memory Mode

- CPS (aspects not addressed by NESAP: electromagnetism and the muon)
- Develop/exploit efficient, network-aware FFT algorithms
- Evaluate and compare Bagel and Grid strategies for developing highly efficient QED+QCD code which includes the muon

**Portability**

The team’s approach to portability to date has been to provide high-performance libraries, such as BAGEL BFM, QOPQDP, and QPhiX, targeting multi-core chips such as Blue Gene/Q, Xeon, and Xeon Phi, and QUDA targeting GPUs. While these libraries are not portable amongst all systems, they do provide similar or near-equivalent capabilities to the calling codes on a wide range of platforms. Beyond libraries the codes have taken different approaches to make the rest of the code portable. As an example, the MILC code focused on incrementally porting components of their code (force terms, calculation of the actions, etc.) to GPUs folding these into the QUDA library. Chroma has taken a different approach by porting its data parallel layer using QDP-JIT/PTX, which allowed all of Chroma written in terms of QDP++ to run with good efficiency on GPUs.