previous | contents | next

440 Part 2½ Regions of Computer Space Section 6 ½ Fault-Tolerant Systems

designed value. Faults may be permanent (caused by a physical failure), intermittent (recurring, probably because a component is on its way to a permanent failure), or transient (induced by something in the outside environment, such as electromagnetic noise).

Error

The first noticeable manifestation of a fault.

 

Since fault-tolerant computers have usually been custom-designed and are one of a kind, there are not yet enough examples to densely populate a design space. Hence Table 1 is sparsely populated and deals primarily with the desired attributes of the final design. Any fault-tolerant design is heavily influenced by the assumed failure model (i.e., fault type and extent) and the system goal. There are two major redundancy techniques: spatial for surviving permanent faults and temporal for cost-effectively surviving transient faults.

Rather than attempt a concise description of the fault-tolerant space, we shall present a brief description of fault-tolerant techniques. These techniques will be introduced with respect to the three functions: detection, diagnosis, and isolation and corrective action.

Various techniques exist for each activity, their use depending upon the allowable period between error generation and error detection. The longer an error, and hence a physical fault, goes undetected, the more data structures in the system may be polluted. The situation is even more critical in a multiprocessor, where memory and data structures are shared by several concurrently executing processes. Errors can be multiplied by nonfailed components that make incorrect decisions or initiate incorrect operations based on the erroneous information. The longer an error goes undetected, the more difficult the recovery is; eventually recovery becomes impossible. Thus the techniques are aimed at detecting errors at well-defined conceptual boundaries. Generally, smaller boundaries are most costly in terms of hardware or time but allow for more complete recovery. Consider the following conceptual boundaries:

Table 1 Fault-Tolerant Dimensions

Fault type

Permanent

Intermittent

Transient

Fault extent

Single
Multiple

Local
Distributed

System measures

Availability
Reliability
Data integrity

Redundancy type

Spatial

Replication
Coding

Temporal

l

recovery is by retry. The goal is to effect recovery without program intervention.

The next three subsections will briefly discuss detection, diagnosis, and isolation and corrective action at each of three conceptual boundaries. For a more thorough discussion of techniques the reader should consult Neumann [1973] and Avizienis [1975].


Detection

The percentage of faults detected is the single most important factor in successful recovery. An undetected error usually has the result that incorrect information crosses conceptual boundaries and ultimately leads to a system failure.1Detection techniques can be continuous (online) or periodic (offline).


1While there is a certain class of applications that rely on statistical properties of data and can function at an acceptable level with internally incorrect data (e.g., speech understanding systems frequently depend on the redundancy in natural speech), those properties have not been applied to computer organization in general and will not be considered here.

previous | contents | next