A single-version scheme of fault tolerant computing

Authors

  • Goutan Kumar Saha Scientist-F, Centre for Development of Advanced Computing, Kolkata, India

Keywords:

bit errors in memory and register, single-version scheme, fail-stop, fault tolerance

Abstract

This paper describes how to design low-cost reliable computing software for various application systems, by incorporating a single-version fault tolerant scheme along with run-time signature-based control-flow checking. Most of the ordinary systems lack fault tolerant software fix. The conventional fault tolerant approaches viz., Recovery Block (RB), N Version Programming (NVP) etc., are too costly to fix in an ordinary low-cost application system because, both the RB and NVP rely on multiple (at least three) versions of both software and computing machines. However, the proposed approach needs a single version (SV) of an enhanced application program that gets executed on one computing machine only. It is common that we often face interrupted service (caused either by an intermittent fault in an application program or in hardware), during the service delivery period of an ordinary cheaper application system. Execution of an application program often show malfunctions or it gets interrupted due to memory bit errors. Error Correction Codes (ECC) (viz., parity, Hamming codes, CRC etc.,) that are used in memory, are not as effective for online correction of multiple bit errors, as they are, for the detection of few bit errors. Again, software implemented ECC has a significant overhead over both time and code redundancy. In other words, built in ECC in memory, cannot recover all bit errors but can detect only. As a result, if an error is detected by ECC, the application program needs to be restarted for its re-execution afresh in various microprocessor based application systems. So, the ECC alone is useful for designing a fail-stop kind of system but it suffers from high time redundancy. Other software implemented fault- tolerance schemes are also towards fail-stop kind. But, the proposed (SV) based approach is capable of tolerating such errors without stopping the execution of an application. This SV Scheme (SVS) aims to provide an uninterrupted service at no extra money, but at an acceptable more execution time and memory space. This SV is a non- fail-stop kind fault tolerance scheme that can be implemented in various computing systems without spending an additional money, and as a result, major part of common people in our society, can gain reliable service from the low-cost, SV-based computing system.

Downloads

Download data is not yet available.

References

1] L. Spainhower and T.A. Gregg, "IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective," IBM Journal of Research & , Vol. 43, No. 5/6, 1999.
[2] T. Sato, "Analyzing Overhead of Reissued Instructions on Data Speculative Processors," Workshop on Performance Analysis and its Impact on Design, held in conjunction with 25th International Symposium on Computer Architecture, 1998.
[3] Stephen B. Wicker, Error Control Systems for Digital Communication and Storage, Prentice Hall, NJ, USA, pp.72- 127, 1995.
[4] B. Randell, "Design - Fault Tolerance," in The Evolution of Fault-Tolerant Computing, A. Avizienis, H. Kopertz, and J.-C. Laprie, eds., Springer-Verlag, Vienna, 1987, pp. 251-270.
[5] A. Avizienis, “The N-Version Approach to Fault – Tolerant Systems,” IEEE Transactions on Software Engineering , Vol. SE -11, No. 12, Dec., 1985, pp.1491-1501.
[6] K.H. Huang, J.A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Transactions on Computers, Vol. 33, 1984, pp. 518-528.
[7] M. Zenha Rela, H.Madeira, J.G. Silva, "Experimental Evaluation of the Fail-Silent Behaviour in Programs with Consistency Checks," Proceedings of the FTCS-26, 1996, pp.394-403.
[8] S. Yau, F. Chen, "An Approach in Concurrent Control Flow Checking," IEEE Transactions on Software Engineering, Vol. SE-6, No. 2, 1980, pp. 126-137.
[9] K.H. Kim and H.O. Welch, “Distributed Execution of Recovery Blocks: An Approach for uniform Treatment of Hardware and Software Faults in Real- Time Applications,” IEEE Transactions on Computers, Vol.38, No. 5, May 1989, pp. 626-636.
[10] R.K. Gupta, C.N. Coelho, G. De. Micheli," Program Implementation Schemes for Hardware - Software Codesign," IEEE Computer, June 1994, pp. 48-55.
[11] Goutam K. Saha, “Transient Fault Tolerant Processing in a RF Application,” International Journal – System Analysis Modelling Simulation, vol. 38, 2000, Gordon and Breach, USA, pp.81-93.
[12] R.K. Gupta, C.N. Coelho Jr., G.De. Micheli, "Sysnthesis and Simulation of Digital Systems Containing Interacting Hardware and Software Components," Proc. Design Automation Conference, June 1992.
[13] Yervant Zorian, Dimitris Gizopoulos, "Design for Yield and Reliability," IEEE Design & Test, May/June, 2004.
[14] G.K. Saha, "Designing an EMI Immune Software for Microprocessor Based Application," Proceedings 11th IEEE International Symposium, EMC'95, Switzerland, March, 1995, pp. 401-404.
[15] A. Benso, P.L. Civera, M. Rebaudengo, M. Sonza Reorda, "An Integrated HW and SW Fault Injection Environment for Real -Time Systems," Proc. IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 1998, pp. 117-122.
[16] Goutam Kumar Saha, "Transient Fault Tolerance in Mobile Agent Based Computing," INFOCOMP Journal of Computer Science, Vol. 4, No. 4, 2005, pp. 1-11.
[17] Goutam Kumar Saha, "Fault Tolerant Computation for a Scientific Application," CSI Communications, Computer Society of India Press Mumbai, Vol. 20(5), May 1996.
[18] Goutam Kumar Saha, "EMP- Fault Tolerant Computing: A New Approach," International Journal of Microelectronic Systems Integration, Vol. 5, No.3, Plenum Publishing Corp, USA, 1997, pp. 183-193.
[19] D.K. Pradhan, Fault - Tolerant Computer System Design, Prentice Hall, 1996.
[20] B. Nicolescu, R. Velazco, M. Sonza-Reorda, "Effectiveness and Limitations of Various Software Techniques for Soft Errors Detection: A Comparative Study," TIMA Lab. Research Reports: ISRN TIMA-RR-01/10-7-FR, 2001, France.
[21] Goutam Kumar Saha, "Software Implemented Fault Tolerance Through Data Error Recovery," ACM Ubiquity, vol. 6(35), September 2005, ACM Press, USA.
[22] C.V. Ramamoorthy et al., "Software Engineering: Problems and Perspectives," Computer, Vol. 17, No. 10, October 1984, pp. 191-209.
[23] Goutam Kumar Saha, “Low- Cost, Fault Tolerance Applications,” IEEE Potentials, Vol. 24, No. 4, 2005, IEEE Press, pp. 35-39.
[24] Goutam Kumar Saha, “Software Based Fault Tolerant Computing,” ACM Ubiquity, vol. 6, No. 40, November 2005, ACM Press, USA.

Downloads

Published

2006-04-03

How to Cite

Saha, G. K. (2006). A single-version scheme of fault tolerant computing. Journal of Computer Science and Technology, 6(01), p. 22–27. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/825

Issue

Section

Original Articles