High availability for parallel computers

Authors

  • Dolores Rexachs del Rosario Computer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 08193, Spain
  • Emilio Luque Fadón Computer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 08193, Spain

Keywords:

Fault tolerance, Availability, RADIC, Transient faults, Performability

Abstract

Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environment

Downloads

Download data is not yet available.

References

[1] Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., & Ortega, D.: COTSon: infrastructure for full system simulation. SIGOPS Oper. Syst. Rev., Vol. 43 (Ed 1), pp. 52-61, 2009.
[2] Bouteiller A., Herault T., Krawezik G., Lemarinier P., and Cappello F.: MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI. Int. J. High Perform. Comput. Appl. Vol. 20, no.3, pp. 319-333, 2006.
[3] Chakravorty, S., Mendes, C. and Kale, L.V. Proactive fault tolerance in large systems. HPCRI Workshop in conjunction with HPCA 2005.pp 363-372, 2005.
[4] Duarte, A., Rexachs, D., Luque, E.: Increasing the cluster availability using RADIC. Cluster Computing, 2006 IEEE International Conference on, pp. 1-8, 2006.
[5] Duarte, A., Rexachs, D., Luque, E.: An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. LNCS Vol. 4192, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 150-157, 2006.
6. Elnozahy E., Alvisi L., Wang Y., and Johnson D.: A Survey of Rollback-Recovery Protocols in Message Passing Systems. ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[7] Fialho L., Santos G., Duarte, A., Rexachs, D., Luque, E.: Challenges and Issues of the Integration of RADIC into Open MPI. LNCS Vol. 5759, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 73-83, 2009.
[8] Fialho L., Duarte, A., Rexachs, D., Luque, E.: Outcomes of the Fault Tolerance Configuration. CACIC 2009.
[9] Engelmann C. and Geist A. Development of naturally fault tolerant algorithms for computing on 100,000 processors. http://www.csm.ornl.gov/~geist. 2002
[10] Gropp, W., Lusk, E.: Fault Tolerance in Message Passing Interface Programs. Int. J. High Perform. Comput. Appl. 18(3), pp. 363–372, 2004.
[11] Kalaiselvi S. and Rajaraman V.: A survey of checkpointing algorithms for parallel and distributed computers. Sadhana, vol. 25, no. 5, pp. 489-510, 2000.
[12] Mukherjee, S. S., Emer, J., & Reinhardt, S. K.. The Soft Error Problem: An Architectural Perspective. HPCA '05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pp. 243-247, 2005.
[13] Nagaraja, K., Gama, G., Bianchini, R., Martin, R. P., Meira Jr., W., and Nguyen. : Quantifying the Performability of Cluster-Based Services. IEEE Trans. Parallel Distrib. Syst. 16, 5, pp. 456-467, 2005.
[14] Santos G., Duarte, A., Rexachs, D., Luque, E.: Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC. LNCS Vol. 5168, Euro-Par 2008, pp. 58-67, 2008.

Downloads

Published

2010-10-01

How to Cite

Rexachs del Rosario, D., & Luque Fadón, E. (2010). High availability for parallel computers. Journal of Computer Science and Technology, 10(03), p. 110–116. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/697

Issue

Section

Invited Articles

Most read articles by the same author(s)

1 2 > >>