Fault Tolerance in Multicore Clusters. Techniques to Balance Performance andDependability

  • Hugo Meyer Computer Architecture and Operating Systems Department (CAOS) Universitat Autónoma de Barcelona, Barcelona, Spain

Abstract

In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.

Downloads

Download data is not yet available.

References

[1] H. Meyer, D. Rexachs, and E. Luque, “Hybrid Message Logging. Combining advantages of Sender-based and Receiver-based Approaches,” Procedia Computer Science, vol. 29, no. 0, pp. 2380 – 2390, 2014, 2014 International Conference on Computational Science. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050914003998
[2] “Managing Receiver-Based Message Logging Overheads in Parallel Applications,” XIX Congreso Argentino de Ciencias de la Computación. Mar del Plata, Argentina, pp. 204–213, 2013.
[3] H. Meyer, R. Muresano, D. Rexachs, and E. Luque, “Tuning SPMD Applications in order to Increase Performability,” The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications, Melbourne, Australia, pp. 1170–1178, 2013.
[4] “A Framework to write Performability-Aware SPMD Applications,” The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, USA, pp. 350–356, 2013.
Published
2016-04-01
How to Cite
MeyerH. (2016). Fault Tolerance in Multicore Clusters. Techniques to Balance Performance andDependability. Journal of Computer Science and Technology, 16(01), p. 59-60. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/511
Section
Thesis Overview