A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

Authors

  • Diego Miguel Montezanti III-LIDI, School of Computer Science, Universidad Nacional de La Plata, La Plata (Buenos Aires), Argentina
  • Enzo Rucci III-LIDI, School of Computer Science, Universidad Nacional de La Plata, La Plata (Buenos Aires), Argentina
  • Dolores Rexachs del Rosario Department of Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Campus UAB, Barcelona, Spain
  • Emilio Luque Fadón Department of Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Campus UAB, Barcelona, Spain
  • Marcelo Naiouf III-LIDI, School of Computer Science, Universidad Nacional de La Plata, La Plata (Buenos Aires), Argentina
  • Armando Eduardo De Giusti III-LIDI, School of Computer Science, Universidad Nacional de La Plata, La Plata (Buenos Aires), Argentina

Keywords:

Transient fault, Parallel scientific application, Soft error detection tool, Message content validation

Abstract

Transient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.

Downloads

Download data is not yet available.

References

[1] Shye, A., Blomstedt, J., Moseley, T., Reddi J., Connors, D. A.: PLR: A software approach to transient fault tolerance for multicore architectures; IEEE Transactions on Dependable and Secure Computing. 6(2), pp. 135 (2009)
[2] Wang, N. J., Quek, J., Rafacz, T. M., Patel, S. J.: Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline. In: Proceedings of the Int Dependable Systems and Networks, pp. 61 IEEE Press, Florence (2004)
[3] Perry, F., Mackey, L., Reis G. A., Ligatti, J., August, D. I., Walker, D.: Fault assembly language. In: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pp. 42 ACM Press, San Diego
[4] Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., August, D. I.: SWIFT: Software Implemented Fault Tolerance. In: Proceedings of the International Symposium on Code generation and optimization, pp. 243 Press, Washington DC (2005)
[5] Baumann, R. C.: Soft errors in commercial semiconductor technology: Overview and scaling trends. In: IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 121 01.1--121 01.14.
[6] Michalak, S. E., Harris, K. W., Hengartner, N. W., Takala, B. E., Wender , S. A.: Predicting the number of fatal soft errors in Los Alamos National Labratory’s ASC Q computer; IEEE Transactions on Device and Materials Reliability. 5(3), pp. 329
[7] Gramacho, J., Rexachs del Rosario, D., Luque, E.: A Methodology to Calculate a Program ´s Robustness against Transient Faults. In: Proceedings of the International 2011 Conference on Parallel and Distributed Processing Techniques and Applications, pp. 645--651. WorldComp Press, Las Vegas (2011)
[8] Mukherjee, S.; Weaver, C.; Emer, J.; Reinhardt, S., Austin, T.: A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 29--40. IEEE Press, San Diego (2003)
[9] Mahmood, A., McCluskey, E. J.: Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers. 37(2), pp. 160--174 (1988)
[10] Reinhardt, S. K., Mukherjee S. S.: Transient Fault Detection via Simultaneous Multithreading. In: Proceedings of the 27th annual International Symposium on Computer Architecture, pp. 25--36. IEEE Press, Vancouver (2000)
[11] Kontz M., Reinhardt S. K., Mukherjee S. S.: Detailed Design and Evaluation of Redundant Multithreading Alternatives. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 99--110. IEEE Press, Anchorage (2002)
[12] Vijaykumar T. N., Pomeranz, I. Cheng, K.: Transient-Fault Recovery using Simultaneous Multithreading. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 87--98. IEEE Press, Anchorage (2002)
[13] Gomaa M., Scarbrough C., Vijaykumar T. N., Pomeranz, I.: Transient-Fault Recovery for chip Multiprocessors. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 98--109. IEEE Press, San Diego (2003)
[14] Rotenberg E.: AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In: Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, pp. 84--91. IEEE Press, Wisconsin (1999)
[15] Oh, N., Shirvani, P. P., McCluskey, E. J.: Control-flow checking by software signatures. IEEE Transactions on Reliability, 51(1), pp. 111-122 (2002)
[16] Oh, N., Shirvani, P. P., McCluskey, E. J.: Error detection by duplicated instructions in superscalar processors; IEEE Transactions on Reliability. 51(1), pp. 63--75 (2002)
[17] Reis, G. A., Chang, J., August, D. I.: Automatic instruction level software-only recovery methods; IEEE Micro Top Picks. 27 (1), pp. 36--47 (2007)
[18] Message Passing Interface Forum, http://www.mpi-forum.org/
[19] Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.J.: Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing; Int. Journal of High Performance Applications. 19(4), pp. 465--478 (2005)
[20] Batchu, R., Dandass, Y., Skjellum, A., Beddhu, M.: MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware; Cluster Computing. 7 (4), pp. 303--315 (2004)
[21] Montezanti, D., Frati, F.E., Rexachs, D., Luque, E., Naiouf, M.R., De Giusti, A.: SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters.; CLEI Electron. J. 15(3), pp. 1--11 (2012)
[22] Leibovich, F., Gallo, S., De Giusti, A., De Giusti, L., Chichizola, F., Naiouf, M.: Comparación de paradigmas de programación paralela en cluster de multicores: pasaje de mensajes e híbrido. In: Anales del XVII Congreso Argentino de Ciencias de la Computación. pp. 241--250. Editorial RedUNCI, La Plata (2011)
[23] Andrews, G.: Foundations of Multithreaded, Parallel, and Distributed Programming. Addison Wesley Longman, EEUU (2000).
[24] Rucci, E., Chichizola, F., Naiouf, M., De Giusti, A.: Parallel Pipelines for DNA Sequence Alignment on Cluster of Multicores. A comparison of communication models.; Journal of Communication and Computer. 9(12), pp. 516--522 (2012)
[25] Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., White, A.: The Sourcebook of Parallel Computing. Morgan Kauffman, EE.UU. (2003)
[26] Graham, R., Shipman, G.: MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives. In: Proceedings of the 15th European PVM/MPIUsers' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface. pp. 130--140. Springer-Verlag Berlin (2008)

Downloads

Published

2014-04-01

How to Cite

Montezanti, D. M., Rucci, E., Rexachs del Rosario, D., Luque Fadón, E., Naiouf, M., & De Giusti, A. E. (2014). A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters. Journal of Computer Science and Technology, 14(01), p. 32–38. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/586

Issue

Section

Original Articles

Most read articles by the same author(s)

1 2 3 4 5 > >>