Parallel I/O performance is crucial to scientific applications on large-scale HPC systems. However, I/O load imbalance in the underlying distributed and shared storage systems can significantly reduce overall application performance. There are two conflicting challenges to mitigate this load imbalance:(i) optimizing the system-wide data placement to maximize the bandwidth advantages of distributed storage servers (i.e., allocating I/O resources efficiently across applications and job runs); and (ii) optimizing the client-centric data movement to minimize the I/O load request latency between clients and servers (i.e., allocating I/O resources efficiently in service to a single application and job run).
This talk highlights the research journey towards a transparent, resource-aware load balancing framework in large-scale parallel file systems. The first part focuses on TAPP-IO, a client-based load balancing library, which relies on a tunable, weighted cost function of available storage system components. TAPP-IO transparently intercepts metadata operations during runtime and balances the workload over all available storage targets. The second part proposes iez an “end-to-end control plane” where clients transparently and adaptively write to a set of selected I/O servers to achieve a balanced data placement. The control plane leverages the real time load information for the global data placement on the distributed storage servers while the design model leverages trace-based optimization techniques to minimize the I/O load request latency between clients and servers.