IaaS Cloud as a virtual environment for experimentation in checkpoint analysis
Cloud Computing offers the possibility of computing resources, allowing remote access to software, storage and data processing through the Internet. Infrastructures as a Service (IaaS), it is a flexible space which can be used as an experimental environment, in which experiments can be carried out similar to a real environment, such as in a cluster can be carried out. Before making installations and changes in a production cluster or select resource in the cloud, it is important to analyze the impact of this change. For this reason we propose using the cloud to carry out the study of previous viability. In this paper, we observe the viability of using the cloud to analyze the behavior of the Checkpoint as one of the Fault Tolerance strategies, establishing the differences that exist in the information generated in a real environment (cluster) and a virtual environment (cloud). The results obtained show that due to the variability of the cloud, the impact on the benefits cannot be analyzed. However, the cloud is suitable for extracting the spatial and temporal behavior pattern of the checkpoint, which helps to characterize it and this will help us to know the right configuration and the development of methodologies and tools that simulate and predict the execution of the checkpoint in a real environment.
P. Gomez, S. Mendez, J. Panadero, B. Aprigio, D. Rexachs and E. Luque, “Cloud, a flexible environment to test HPC I/O configurations,” Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18, pp. 197-203, 2018.
J. Weissman, “Fault Tolerant Wide-Area Parallel
Computing,” International Parallel and Distributed Processing Symposium, pp. 1214-1225, 2000.
D. Mittal and N. Agarwal, "A review paper on Fault Tolerance in Cloud Computing," 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2015, pp. 31-34.
F. Santamaría, J. Ballesteros and J. González, “Plataforma cloud computing como infraestructura tecnológica para laboratorios virtuales, remotos y adaptativos -Cloud computing as technologic infrastructure for virtual, remote and adaptive labs”, Revista Científica, 3(23), pp. 98-110, 2016.
A. Mohammad, Al-Rousan Mohammad, Y. Eman and E. Hanem, “A Study on Fault Tolerance Mechanisms in Cloud Computing,” International Journal of Computer Electrical Engineering, pp. 62-71, 2017.
P. Gómez and D. Rexachs, “Methodology to select a I/O configuration (hardware resources and stack software) in cloud platform,” BSC Doctoral Symposium, 2nd ed. Barcelona: Barcelona Supercomputing Center, pp. 143-144, 2015.
D. Kochhar and H. Jabanjalin, “An approach for fault tolerance in cloud computing using machine learning technique,” International Journal of Pure and Applied Mathematics, volume 117, No. 22, pp. 345-351, 2017.
K. Devi and D. Paulraj, "Multi level fault tolerance in cloud environment," International Conference on Intelligent Computing andControl Systems (ICICCS), Madurai, pp. 824-828, 2017.
B. Nicolae and F. Cappello, "BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots," SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Seatle, WA, pp. 1-12, 2011
B. Azeem and M. Helal, "Performance evaluation of checkpoint/restart techniques: For MPI applications on Amazon cloud," 9th International Conference on Informatics and Systems, Cairo, pp. 49-57, 2014.
A. Bouteiller, P. Lemarinier, G. Krawezik and F. Capello, "Coordinated checkpoint versus message log for fault tolerant MPI," Proceedings IEEE International Conference on Cluster Computing, Hong Kong, China, pp. 242-250, 2003.
L. Fialho, D. Rexachs and E. Luque, "What is Missing in Current Checkpoint Interval Models?," 31st International Conference on Distributed Computing Systems, Minneapolis, MN, pp. 322-332, 2011.
A. Kongmunvattana, S. Tanchatchawal and Nian-Feng Tzeng, "Coherence-based coordinated checkpointing for software distributed shared memory systems," Proceedings 20th IEEE International Conference on Distributed Computing Systems, Taipei, Taiwan, pp. 556-563, 2000.
A. Jason, A. Kapil and G. Cooperman, “DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop,” 23rd IEEE International Parallel and Distributed Processing Symposium, 2007.
J. Cao, K. A. Kapil, R. Garg, S. Matott, D. Panda, H. Subramoni, J. Vienne, G. Cooperman, "System-Level Scalable Checkpoint-Restart for Petascale Computing," IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Wuhan, pp. 932-941, 2016.
L.M. Silva and J.G. Silva, “An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine,” EDCC-3. EDCC Lecture Notes in Computer Science, vol 1667. Springer, Berlin, Heidelberg, 1999
D. Bailey, “The Nas Parallel Benchmarks,” International Journal of High Performance Computing Applications, pp. 63-73, 1991.
J. Scheuner and P. Leitner, “Estimating Cloud Application Performance Based on Micro- Benchmark Profiling” IEEE 11th International Conference on Cloud Computing (CLOUD), pp. 90-97, 2018.
Copyright (c) 2019 Betzabeth León, Pilar Gomez-Sanchez, Daniel Franco, Dolores Rexachs, Emilio Luque
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.