Performance Observability of Service Architectures and Their Integration in High Performance Computing Workflows

Traditionally, High Performance Computing (HPC) software has been built and deployed as bulk-synchronous, parallel executables based on the message-passing interface (MPI) programming model. The rise of data-oriented computing paradigms and an explosion in the variety of applications that need to be supported on HPC platforms has forced a re-think of the appropriate programming and execution models to integrate this new functionality. Service-oriented architectures and a broader class of in situ workflows demarcate a paradigm shift in HPC software development methodologies that have enabled a range of new applications --- from user-level data services to machine learning (ML) workflows that run alongside traditional scientific simulations. In this research work, we propose techniques and accompanying tools to enable the performance observability and monitoring of in situ HPC workflows that involve distributed services. Conversely, we also demonstrate the value of deploying performance monitoring and visualization as shared services within an in situ workflow.

The results from this work suggest that: (1) integration of performance data from different sources is vital to understanding the performance of service components, (2) the in situ (online) analysis of this performance data is needed to enable adaptivity of distributed components, and (3) services are a promising architecture choice for deploying in situ performance monitoring and visualization functionality."

Argonne Leadership Computing Facility

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

Featured: MyALCF

Performance Observability of Service Architectures and Their Integration in High Performance Computing Workflows

04/20/2022, 3 – 4pm CT