Keeping Science on Keel when Software Errs or Moves

Ganesh L. Gopalakrishnan, University of Utah
Seminar

Significant investments are made into the creation and maintenance of high-performance computing software. During this period, the computing hardware keeps changing, especially with "End of Moore's Law" heterogeneity.  Unfortunately, this process can introduce bugs, change computed numerical answers, or increase proclivity to soft errors.

Through collaborations with National Labs, our group has developed solutions addressing these contingencies, embodied in the following tools. The first three below are available from PRUNERS/Github and the rest are under active development. My talk will provide an overview of all these tools, and deep-dive into any of these tools based on audience interest (default: DiffTrace, hoping for feedback).

  • Archer (IPDPS'16): State-of-the-art OpenMP race checker that is in production use.
  • Sword (IPDPS'18): A memory-efficient race checker for OpenMP.
  • FLiT (HPDC'19): A tool that can diagnose why a compiler optimization changes answers unacceptably.
  • FailAmp: A facility that helps reduce the extent of simulation state corruption following certain types of soft errors.
  • FPDetect: A novel approach to detect soft errors in the data space through rigorous floating-point roundoff analysis.
  • DiffTrace: A PIN-based facility that brings out whole program function call traces for offline analysis.

Miscellaneous Information: 

This Seminar will be streamed. See details at https://anlpress.cels.anl.gov/cels-seminars/

Please click here [schedule.ics] to add this event to your calendar.