SLIDE 95 Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion
Bibliography
Modeling of System-Level Checkpointing
[BBB+12b] G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra,
- A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni.
Unified model for assessing checkpointing protocols at extreme-scale. Technical report, Innovative Computing Laboratory, University of Tennessee, jun 2012. [BCD+13]
- A. Bouteiller, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, ,
and Y. Robert. Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization. Technical report, Innovative Computing Laboratory, University of Tennessee, feb 2013.
Checkpoint on Failure
[BDB+12]
- W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra.
A checkpoint-on-failure protocol for algorithm-based recovery in standard mpi. aug 2012.
herault@icl.utk.edu ABFT for Linear Algebra 64/ 66