C:\BELLBOOK\P001-100\HTMFILES\CSP0456.HTM

440 Part 2½ Regions of Computer Space Section 6 ½ Fault-Tolerant Systems

	designed value. Faults may be permanent (caused by a physical failure), intermittent (recurring, probably because a component is on its way to a permanent failure), or transient (induced by something in the outside environment, such as electromagnetic noise).
Error	The first noticeable manifestation of a fault.

Since fault-tolerant computers have usually been custom-designed and are one of a kind, there are not yet enough examples to densely populate a design space. Hence Table 1 is sparsely populated and deals primarily with the desired attributes of the final design. Any fault-tolerant design is heavily influenced by the assumed failure model (i.e., fault type and extent) and the system goal. There are two major redundancy techniques: spatial for surviving permanent faults and temporal for cost-effectively surviving transient faults.

Rather than attempt a concise description of the fault-tolerant space, we shall present a brief description of fault-tolerant techniques. These techniques will be introduced with respect to the three functions: detection, diagnosis, and isolation and corrective action.

Various techniques exist for each activity, their use depending upon the allowable period between error generation and error detection. The longer an error, and hence a physical fault, goes undetected, the more data structures in the system may be polluted. The situation is even more critical in a multiprocessor, where memory and data structures are shared by several concurrently executing processes. Errors can be multiplied by nonfailed components that make incorrect decisions or initiate incorrect operations based on the erroneous information. The longer an error goes undetected, the more difficult the recovery is; eventually recovery becomes impossible. Thus the techniques are aimed at detecting errors at well-defined conceptual boundaries. Generally, smaller boundaries are most costly in terms of hardware or time but allow for more complete recovery. Consider the following conceptual boundaries:

Hardware subsystem. Typical subsystems may range in size from an arithmetic unit to processors, memories, and buses. Error detection is performed by hardware, and

Table 1 Fault-Tolerant Dimensions

Fault type

Permanent

Intermittent

Transient

Fault extent

Single
Multiple

Local
Distributed

System measures

Availability
Reliability
Data integrity

Redundancy type

Spatial

Replication
Coding

Temporal

recovery is by retry. The goal is to effect recovery without program intervention.

Task. A dynamic program environment spread across several hardware subsystems. Error detection can be performed at task boundaries by software. Intermediate data may be incorrect, but data passed between task boundaries is correct.

System. The total hardware/software environment. At this level, application-dependent characteristics are used to simplify the detection/recovery functions. (The previous two levels were application-independent). Here the focus is on continuous service, as opposed to having totally correct data crossing the system boundary. An example might be a sonar signal processor and display. Data arrive continuously, but errors can be detected on a millisecond basis and recovery can be a cold restart.

The next three subsections will briefly discuss detection, diagnosis, and isolation and corrective action at each of three conceptual boundaries. For a more thorough discussion of techniques the reader should consult Neumann [1973] and Avizienis [1975].

Detection

The percentage of faults detected is the single most important factor in successful recovery. An undetected error usually has the result that incorrect information crosses conceptual boundaries and ultimately leads to a system failure.¹Detection techniques can be continuous (online) or periodic (offline).

Hardware subsystem. Detection techniques include replication (duplication and comparison [Downing, Nowak, and Tuomenoksa, 1964; Vance, 1957]) and coding for data transmission/storage (parity; arithmetic codes [Rao, 1974; Avizienis, 1971]); self-checking checkers [Anderson and Metze, 1973; Carter et al., 1971; Carter and Schneider, 1969]; cyclic redundancy codes [Peterson and Weldon, 1972]; and hardware processor checks (generated by the hardware subsystem level [IBM, 1972a]).

Task. Detection techniques include audit programs (checking the integrity of data structures); checksums; memory violations, in which the task attempts to access a memory which does not exist or which it has no right to access [Schroeder and Saltzer, 19721 or is attempting an incorrect

¹While there is a certain class of applications that rely on statistical properties of data and can function at an acceptable level with internally incorrect data (e.g., speech understanding systems frequently depend on the redundancy in natural speech), those properties have not been applied to computer organization in general and will not be considered here.

previous | contents | next