H-RADIC: A Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

  • Ambrosio Royo CAOS – Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Barcelona 08193, Spain
  • Jorge Villamayor CAOS – Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Barcelona 08193, Spain
  • Marcela Castro-León CAOS – Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Barcelona 08193, Spain
  • Dolores Rexachs CAOS – Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Barcelona 08193, Spain
  • Emilio Luque CAOS – Computer Architecture and Operating Systems, Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Barcelona 08193, Spain
Keywords: Cloud, Fault-Tolerance, High- Performance Computing, RADIC

Abstract

Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment´s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.

Downloads

Download data is not yet available.

References

B. Darrow, “Windows Azure outage hits Europe,” 26-Jul-2012. [Online]. Available: https://gigaom.com/2012/07/26/windows-azureoutage-hits-europe/. [Accessed: 30-Mar-2018].

O. Malik, “Severe storms cause Amazon Web Services outage,” 29-Jun-2012. [Online]. Available: https://gigaom.com/2012/06/29/some-of-amazonweb-services-are-down-again/. [Accessed: 30-Mar-2018].

“Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region,” Amazon Web Services, Inc. [Online]. Available: https://aws.amazon.com/message/41926/. [Accessed: 31-Mar-2018].

“Google Cloud Status Dashboard.” [Online]. Available: https://status.cloud.google.com/incident/storage/17002. [Accessed: 31-Mar-2018].

J. Hult, “Oracle Cloud - unplanned outage - November 7, 2017,” JonathanHult.com, 17-Nov- 2017. .

J. Villamayor, D. Rexachs, and E. Luque, “RaaS: Resiliance as a Service – Fault Tolerance for High Performance Computing in Clouds,” presented at the International Symposium on Cluster, Cloud and Grid Computing, 2018, p. Accepted.

A. Gómez, L. M. Carril, R. Valin, J. C. Mouriño, and C. Cotelo, “Fault-tolerant virtual cluster experiments on federated sites using BonFIRE,” Future Gener. Comput. Syst., vol. 34, pp. 17–25, May 2014.

L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka, “FTI: High Performance Fault Tolerance Interface for Hybrid Systems,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 2011, p. 32:1–32:32.

S. Di, Y. Robert, F. Vivien, and F. Cappello, “Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 1, pp. 244–259, Jan. 2017.

I. P. Egwutuoha, S. Chen, D. Levy, B. Selic, and R. Calvo, “Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud,” Int. J. Parallel Emergent Distrib. Syst., vol. 29, no. 4, pp. 363–378, Jul. 2014.

L. Fialho, G. Santos, A. Duarte, D. Rexachs, and E. Luque, “Challenges and Issues of the Integration of RADIC into Open MPI,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer, Berlin, Heidelberg, 2009, pp. 73–83.

M. Castro-León, H. Meyer, D. Rexachs, and E. Luque, “Fault tolerance at system level based on RADIC architecture,” J. Parallel Distrib. Comput., vol. 86, pp. 98–111, Dec. 2015.

“NAS Parallel Benchmarks,” NASA Advanced Supercomputing Division. [Online]. Available: https://www.nas.nasa.gov/publications/npb.html. [Accessed: 23-May-2018].

“MPICH | High-Performance Portable MPI,” MPICH. [Online]. Available: https://www.mpich.org/. [Accessed: 02-Jun-2018].

J. Ansel, K. Arya, and G. Cooperman, “DMTCP: Transparent checkpointing for cluster computations and the desktop,” in 2009 IEEE International Symposium on Parallel Distributed Processing, 2009, pp. 1–12.

D. Tao, S. Di, X. Liang, Z. Chen, and F. Cappello, “Improving Performance of Iterative Methods by Lossy Checkponting,” ArXiv180411268 Cs, Apr. 2018.

Published
2018-12-12
How to Cite
Royo, A., Villamayor, J., Castro-León, M., Rexachs, D., & Luque, E. (2018). H-RADIC: A Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments. Journal of Computer Science and Technology, 18(03), e24. https://doi.org/10.24215/16666038.18.e24
Section
Original Articles