Intelligent Job Scheduling on HPC Systems

Yuping Fan, Ph.D., llinois Institute of Technology
Seminar
LCF Seminar graphic featuring the title and the date.

Job scheduler is a crucial component in high-performance computing (HPC) systems. It plays an important role in the efficient use of system resources and users satisfaction. Existing HPC job schedulers typically leverage simple heuristics to schedule jobs. However, the rapid growth in system infrastructure and the introduction of diverse workloads pose serious challenges to the traditional heuristic approaches.  We propose an intelligent scheduling framework leveraging machine learning and optimization techniques to address these emerging challenges.

In this framework, we design a deep reinforcement learning agent that automatically learns efficient workload- and system-specific scheduling policies. In addition, we conduct comprehensive analysis on allocating the hybrid workloads (i.e., on-demand, rigid and malleable jobs) within one HPC system. Future work includes extending the intelligent scheduler to multi-agent reinforcement learning for scheduling across diverse resources and optimizing deep learning workloads.