Skip to main content
Article | Mathematics and Computer Science

Cappello Gives Tutorial on Resilience for HPC

Franck Cappello co-presented a tutorial July 1 on resilience for high-performance computing at the 46th annual IEEE/IFIP international conference on Dependable Systems and Networks (DSN 2016).

Cappello, a senior computer scientist in Argonne’s Mathematics and Computer Science Division and project manager of research on resilience at the extreme scale, has long been interested in the problem of fault management in high-performance computing. 

Faults will be an inevitable – and frequent – part of the computing environment on emerging exascale platforms,” said Cappello. Vendors, software developers, and high-performance application users all need to acquire a deeper understanding of faults, their consequences and an awareness of the various techniques to mitigate them.”

Cappello and his colleague George Bosilca of the University of Tennessee addressed these needs in their tutorial titled Understanding Fault Management in HPC.”  The tutorial included four main topics:

  • An overview of failure types (software/hardware, transient/fail-stop) observed in the field and typical probability distributions (e.g., exponential, log-normal) used to model the interarrival times of failures
  • General-purpose techniques, including several fault tolerance protocols, replication, prediction, and silent error detection
  • Application-specific techniques, such as algorithm-based fault tolerance for grid-based algorithms or fixed-point convergence for iterative applications
  • Practical deployment of fault tolerant techniques, including examples based on computational solver routines with a mix of traditional and advanced recovery techniques, in a hands-on session.

Cappello noted that for a long time, the field of recovery techniques was centered on rollback recovery approaches, usually based on simple coordinated checkpoint/restart. Newly formulated general-purpose methods, advanced fault tolerant protocols, algorithm- based fault tolerance, programming models for fault tolerance, and various verification mechanisms for silent error detection and correction now provide a much richer environment to choose from.

By identifying the qualitative benefits and drawbacks of these different methods, we hope to enable the attendees to determine, integrate, and adapt the technique that best suits their application or platform,” said Cappello.

DSN2016 is an international forum for presenting research results, problem solutions, practices, and insights on new challenges in the field of dependable computing and security. The 2016 meeting took place in Toulouse, France, June 28 to July1. For further information, see the website.