Scalable, Flexible Tools for Understanding the HPC Environment

Jim Brandt
Seminar

Understanding applications' behaviors and their interactions with system software and hardware is becoming increasingly difficult as the complexity of all three components increases. Thus, tools for understanding these in the contexts of both failure and performance are becoming more important. In the case of failure, early detection and attribution can increase productivity of both platform and user through the ability to quickly respond. In the context of performance, understanding how resources are being used can again drive increased productivity through more intelligent resource requests, allocations, and use. This talk will present work being done at Sandia on scalable lightweight tools for HPC monitoring and analysis of all three components as well as for feedback to drive application load balancing.