Optimizing I/O for Large-Scale Scientific Applications using I/O Delegation

Arifa Nisar
Seminar

Ever increasing research and development in science and engineering is based on I/O and data intensive simulations and/or analysis of observational data, requiring the use of high-performance computing (HPC) systems. Traditional
interfaces in file systems and storage systems are designed to handle the worst-case scenarios for conflicts, synchronization, locking, coherence checks, and other issues, which adversely affect the I/O performance. In
many cases, the problem is not that of having insufficient I/O capacity or bandwidth, but it is the excessive synchronization of I/O accesses at the I/O layer, generated by massively parallel applications.

We have designed and implemented an I/O delegate system that uses a subset of processes to carry out the I/O tasks for an application. By placing the I/O system close to the applications and allowing the applications to pass the high-level data access information, the I/O system has more opportunity to provide better performance. One of the most important features of the I/O delegate system is that it allows communication among delegates and enables their collaboration for further optimizations, such as collaborative caching, I/O aggregation, load balancing, and request alignment. We achieved I/O Bandwidth improvement percentages ranging from 25 % to 260 % by allocating 2-3 % additional compute nodes to be used as I/O delegate nodes.

Using I/O delegate system as the basic infrastructure, we have developed a method that assigns disjoint file regions to the I/O delegates such that, lock conflicts and overlapping accesses are resolved at the delegate system.
For example, Lustre file system has a server-based locking protocol, we configure I/O delegates to have one-to-one or one-to-many mapping to the I/O servers. This strategy of having persistent pairing between servers and clients reduces the number of lock acquisitions to only one, eliminates lock contention altogether, produces more effective data prefetching, and less cache coherence control overhead. By allocating only 1/8th of additional
compute nodes as I/O delegates, we achieved I/O Bandwidth improvement percentages ranging from 100 % to 10000% for applications using MPI Independent I/O operations.