Containment Domains for Scalable and Efficient Resilience

Mattan Erez
Seminar

In this talk, I will present a scalable and efficient resiliency scheme based on the concept of Containment Domains. Containment domains are programming and system constructs that encapsulate and express application resiliency needs and interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine hierarchy and to enable distributed and hierarchical state preservation, restoration, and recovery as an alternative to non-scalable and inefficient checkpoint-restart (and variants). One of the key motivations behind this work is the idea of proportionality, where the resources devoted to a feature are proportional to the application and scenario needs. Proportionality is critical to continued scaling and performance under the increasing constraints of bandwidth, power, and energy. Essentially, one-size-fits-all and worst-case design approaches are no longer sufficient to building reliable and efficient systems. Time permitting, I will describe additional projects in my group that enable proportional resilience and bandwidth usage in the memory system.