Improving Reliability and QoS in CIFTS

Ziming Zheng
Seminar

Abstract: As the system size and complexity continue to grow, a critical challenge facing high end computing is fault management of these systems. An increasing attention is to explore coordination among multiple software components to enhance system-wide resilience. Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) enables system software components to share fault information with each other and adapt to faults in a holistic manner. However, currently CIFTS does not provide reliable messaging mechanism, which is critical for the correctness, effectiveness, and robustness of the coordination among multiple software components.

In this talk, we first describe the reliability mechanism of other messaging frameworks, and discuss the requirements and challenges in the reliable messaging for CIFTS. Then we introduce the design of the FTB reliable/guarantee mechanism. We also discuss the potential extensions to the FTB API to support reliable event publishing. Our mechanism does not need to block the publishers, which provides a simple and efficient manner for reliable messaging. Finally, preliminary experimental results show that our mechanism is helpful to improve the reliability of CIFTS with low overhead for the system software components.

Bio: Ziming Zheng is a PhD candidate at Illinois Institute of Technology. He received his BS and MS degrees from the College of Computer Science and Engineering, University of Electronic Science and Technology of China. His research interest is fault resilience for large scale systems. He was an intern at Argonne National Laboratory and Oak Ridge National Laboratory for CIFTS project. More details about Ziming Zheng are available at http://www.iit.edu/~zzheng11/.