Robust System Design

Event Sponsor: 
Mathematics and Computing Science Seminar
Start Date: 
May 5 2015 - 2:00pm
Building 240/Room 1406-1407
Argonne National Laboratory
Yanjing Li
Speaker(s) Title: 
Intel Labs and Stanford University
Marc Snir

Malfunctions in electronic systems can have major consequences ranging from loss of data and services, to financial and productivity losses, or even loss of human life. Such impacts continue to increase as systems become more complex, interconnected, and pervasive. Hardware failures are especially a growing concern because:
1. Existing test and validation methods barely cope with today’s complexity. New techniques will be essential to minimize the effects of defects and design flaws.
2. For coming generations of silicon technologies, several failure mechanisms that were largely benign in the past are now becoming visible at the system level. A large class of future systems will require tolerance of hardware errors during their operation.

Robust system design is required to ensure that future electronic systems, from supercomputers all the way to embedded systems, perform correctly despite rising levels of complexity and disturbances. Traditional fault-tolerant computing techniques are generally very expensive, and often inadequate, for this purpose.

In this talk, I will present a new online self-test and diagnostics technique, called CASP, that is essential for robust system design. CASP enables a system to test itself thoroughly during normal operation to quickly detect and localize hardware failures. CASP is very thorough with respect to a wide variety of test coverage metrics (96-99.5%) while incurring only 1% area and power costs, and 3% performance cost. In contrast, existing techniques suffer from low coverage (e.g., 70%), high area costs (e.g., 20%), or significant performance penalties (e.g., 30%) including possible system unresponsiveness. A key aspect of our approach is the orchestration across multiple abstraction layers: physical design, architecture, and system software. I will demonstrate the effectiveness and practicality of our technique using results from the industrial OpenSPARC T2 multi-core design and Intel hardware platform.

Miscellaneous Information: 

Short Bio: Yanjing Li is a senior research scientist at Intel Labs and a visiting scholar at Stanford University. She received her Ph.D. in Electrical Engineering from Stanford University. Her research interests include robust system design, energy-efficient systems, system validation and test, computer architecture, and system software. Dr. Li received the European Design and Automation Association Outstanding Dissertation Award, the IEEE International Test Conference Best Student Paper Award, and the IEEE VLSI Test Symposium Best Paper Award for novel research on robust system design, and three Intel Divisional Recognition Awards for mobile processor designs that have been adopted by product groups at Intel.