Description: Building efficient and scalable system software, especially performance analysis and monitoring, for large-scale systems, is increasingly important both for the developers of parallel applications and the designers of next-generation HPC systems. However, conventional performance tools suffer from significant time/space overhead due to the ever-increasing problem size and system scale. On the other hand, the cost of source code analysis is independent of the problem size and system scale, making it very appealing for large-scale performance analysis. Inspired by this observation, we have designed a series of light-weight system software for HPC systems, such as a memory access monitoring tool, a performance variance detection tool, and a communication trace compression tool. In this talk, I will share our experience on building these tools through combining static analysis and runtime analysis and also point out the main challenges in this direction.
Bio: Jidong Zhai is a Tenured Associate Professor in the Computer Science Department of Tsinghua University. He is a recipient of Siebel Scholar, CCF outstanding doctoral dissertation award, and NSFC Young Career Award. He was a Visiting Professor of Stanford University (2015-2016) and a Visiting Scholar of MSRA (Microsoft Research Asia) in 2013. His research interests include high performance computing, performance evaluation, compiler, and heterogeneous computing. He has published more than 40 papers in prestigious refereed conferences and top journals including SC, PPOPP, ASPLOS, ICS, ATC, MICRO, IEEE TPDS, and IEEE TC. His research received a Best Paper Finalist at SC’14. He is the advisor of Tsinghua Student Cluster Team. The team led by him has achieved 9 international champions in student supercomputing challenges at SC, ISC, and ASC. In 2015 and 2018, the team led by him swept all three champions at SC, ISC, and ASC. He was a program co-chair of NPC 2018 and a program co-chair of ICPP PASA 2015 workshop. He served or is now serving TPC member of SC, ICS, PPOPP, ICPP, NAS, LCPC, and Euro-Par. He is the general secretary of ACM SIGHPC China. He is currently on the editorial board of IEEE Transactions on Parallel and Distributed Systems.