Batsim: a Realistic Language-Independent Resources and Jobs Management Systems Simulator +HPC and Big Data Scheduling Convergence

Michael Mercier
Seminar

TAs large scale computation systems are growing to exascale, Resources and Jobs Management Systems (RJMS) need to evolve to manage this scale modification. However, their study is problematic since they are critical production systems, where experimenting is extremely costly due to downtime and energy costs. Meanwhile, many scheduling algorithms emerging from theoretical studies have not been transferred to production tools for lack of realistic experimental validation. To tackle these problems we propose Batsim, an extendable, language-independent and scalable RJMS simulator. It allows researchers and engineers to test and compare any scheduling algorithm, using a simple event-based communication interface, which allows different levels of realism.

In this seminar we will present how Batsim works, how to use it, then we will present our experiment that shows that Batsim's behavior matches the one of the real RJMS OAR. Our evaluation process was made with reproducibility in mind and all the experiment material is freely available. Finally, I will present my PhD research topics about HPC and Big Data Scheduling Convergence or how to make RJMS and Big Data scheduler interact properly.