Abstract: Compute intensive machine learning (ML) applications are becoming one of the popular workloads running atop cloud infrastructure. To fulfill the increasing demand for ML workloads, cloud service providers such as Amazon AWS, Microsoft Azure, and Google Cloud offer a specialized service called machine learning as a service (MLaaS). While training and serving ML applications using MLaaS, practitioners face the challenges of tuning application-specific hyper-parameters such as batch size, learning rate, and optimization algorithms. In addition, the developers also have to take care of various system-level parameters such as the number of training nodes, communication topology during training, instance type, and the number of serving nodes, to meet the SLO requirements for bursty workload during the inference. This talk will discuss the key challenges in existing ML systems and present high performing and efficient ML systems to speed up training and inference tasks while enabling automated and robust system management.
Please use this link to attend the virtual seminar: