Building Algorithm-Based Energy Efficient High Performance Computing Systems with Resilience

Li Tan
Seminar

As the exascale supercomputers are expected to embark around 2020, supercomputers nowadays expand rapidly in size and duration in use, which brings demanding requirements of energy efficiency and resilience. These requirements are becoming prevalent and challenging, considering the crucial facts that: (a) the costs of powering a supercomputer grow greatly together with its expanding scale, and (b) failure rates of large-scale high performance computing (HPC) systems are dramatically shortened due to a large amount of compute nodes interconnected as a whole. It is thus desirable to consider both crucial dimensions for building scalable, cost-efficient, and robust HPC systems in this era. Specifically, our goal is to fulfill the optimal performance-power-failure ratio while exploiting parallelism during HPC runs.

Within a wide range of HPC applications, numerical linear algebra (LA) operations are fundamental and have been extensively used for science and engineering fields. Saving energy for the LA operations thus significantly contributes to the energy efficiency of scientific computing nowadays. Although with high generality, existing OS level machine-learning-based solutions can effectively save energy for some applications in a black-box fashion, they are however defective for applications with variable workloads such as the LA operations – the optimal energy savings cannot be achieved due to potentially inaccurate and high-cost workload prediction they rely on. Therefore, we propose to utilize algorithmic characteristics of the LA operations to accurately predict idle-time of processors, i.e., slack, and thus maximize potential energy savings. Typically, when processors are experiencing slack during HPC runs, energy savings can be achieved by leveraging hardware power-aware techniques to scale down processor frequency during underused execution phases. For HPC systems, we propose to decrease the supply voltage associated with a given operating frequency for processors to further reduce power consumption at the cost of increased failure rates. We leverage the mainstream resilience techniques to tolerate the increased failures caused by the undervolting technique. Our strategy is theoretically validated and empirically evaluated to save more energy than a state-of-the-art energy efficient solution, with the guarantee of correctness.