Speeding up Nek5000 with Autotuning and Specialization

Event Sponsor: 
Mathematics and Computer Science Division Seminar
Start Date: 
Oct 29 2009 - 10:30am to 11:30am
Building/Room: 
Bldg: 240, Conference Room 1404, 1405, and 1406
Location: 
Argonne National Laboratory
Speaker(s): 
Jaewook Shin
Speaker(s) Title: 
Mathematics and Computer Science Division
Host: 
Paul Hovland

Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation to select the best-performing solution for a particular architecture. At a LANS seminar in May, I introduced compiler-based empirical performance tuning and presented my success of applying it to a dense matrix-multiply kernel for small, rectangular matrices.

In this talk, I will begin with a summary of the talk, and then present my recent progress since then. A major result is that I could use the same technique to tune a higher-level kernel which is a loop with a call to a dense matrix multiply routine for small matrices. The kernel performance is up to 82% of peak on an AMD Phenom processor. With the tuned higher-level kernel and the library of tuned matrix multiply routines produced earlier, the whole Nek5000 program achieves 21% speedup on 256 nodes of the Cray XT5 at Oak Ridge National Laboratory. Also, I will show the overheads and fluctuations in measurements and how I overcame them for this experiment.

Miscellaneous Information: 

There will be coffee before the seminar with the new LANS espresso machine.