An Integrated Resource Management and Scheduling Framework for Production Supercomputers

Dr. Wei Tang
Seminar

Job scheduling is essential on large-scale computing systems. While research on scheduling has been conducted for decades, our work is motivated by emerging practical issues observed in current production supercomputers, caused by reasons such as human behaviors, new workload characteristics, and increasing system complexity. Further, several challenges are identified in building extreme scale supercomputers, such as reliability, I/O performance, and energy efficiency, which, from our perspective, can be mitigated by appropriate job scheduling strategies. In this talk, Dr. Tang will introduce an integrated resource management and scheduling framework, aiming at addressing the issues and challenges in resource management for large-scale production supercomputers. In this work, he has designed a set of new schemes, implemented them in a production resource manager named Cobalt, and evaluated them with real job traces from production the Blue Gene/P system at Argonne National Laboratory. Experimental results show schemes can effectively improve job scheduling regarding both user satisfaction and system utilization.