Fault Tolerance in Multicore Clusters. Techniques to Balance Performance andDependability
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.
 “Managing Receiver-Based Message Logging Overheads in Parallel Applications,” XIX Congreso Argentino de Ciencias de la Computación. Mar del Plata, Argentina, pp. 204–213, 2013.
 H. Meyer, R. Muresano, D. Rexachs, and E. Luque, “Tuning SPMD Applications in order to Increase Performability,” The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications, Melbourne, Australia, pp. 1170–1178, 2013.
 “A Framework to write Performability-Aware SPMD Applications,” The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, USA, pp. 350–356, 2013.