Building Algorithm-Based Energy Efficient High Performance Computing Systems with Resilience

Event Sponsor: 
Mathematics and Computing Science Seminar
Start Date: 
May 12 2015 - 10:30am
Building 240/Room 4301
Argonne National Laboratory
Li Tan
Speaker(s) Title: 
Postdoc Interviewee - MCS
Stefan Wild

As the exascale supercomputers are expected to embark around 2020, supercomputers nowadays expand rapidly in size and duration in use, which brings demanding requirements of energy efficiency and resilience. These requirements are becoming prevalent and challenging, considering the crucial facts that: (a) the costs of powering a supercomputer grow greatly together with its expanding scale, and (b) failure rates of large-scale high performance computing (HPC) systems are dramatically shortened due to a large amount of compute nodes interconnected as a whole. It is thus desirable to consider both crucial dimensions for building scalable, cost-efficient, and robust HPC systems in this era. Specifically, our goal is to fulfill the optimal performance-power-failure ratio while exploiting parallelism during HPC runs.

Within a wide range of HPC applications, numerical linear algebra (LA) operations are fundamental and have been extensively used for science and engineering fields. Saving energy for the LA operations thus significantly contributes to the energy efficiency of scientific computing nowadays. Although with high generality, existing OS level machine-learning-based solutions can effectively save energy for some applications in a black-box fashion, they are however defective for applications with variable workloads such as the LA operations – the optimal energy savings cannot be achieved due to potentially inaccurate and high-cost workload prediction they rely on. Therefore, we propose to utilize algorithmic characteristics of the LA operations to accurately predict idle-time of processors, i.e., slack, and thus maximize potential energy savings. Typically, when processors are experiencing slack during HPC runs, energy savings can be achieved by leveraging hardware power-aware techniques to scale down processor frequency during underused execution phases. For HPC systems, we propose to decrease the supply voltage associated with a given operating frequency for processors to further reduce power consumption at the cost of increased failure rates. We leverage the mainstream resilience techniques to tolerate the increased failures caused by the undervolting technique. Our strategy is theoretically validated and empirically evaluated to save more energy than a state-of-the-art energy efficient solution, with the guarantee of correctness.

Miscellaneous Information: 

Bio: Li Tan is a Ph.D. candidate in the Department of Computer Science and Engineering at the University of California, Riverside. His chief research interest is High Performance Computing (HPC), in particular improving energy and power efficiency, fault tolerance and reliability for high performance numerical linear algebra algorithms and applications, and software debugging in large-scale HPC environments. He served as an external reviewer of prestigious conferences on high performance parallel and distributed computing, such as SC, IPDPS, PACT, and CCGrid. He is a recipient of Dean's Distinguished Fellowship from University of California, Riverside in 2010. He is a Student Member of the IEEE and a Student Member of the ACM.