C:\BELLBOOK\P001-100\HTMFILES\CSP0473.HTM

Chapter 27 ½ The STAR (Self-Testing And Repairing) Computer 457

Reconfiguration processing is required for memory replacement, since software assistance is required to load a newly activated memory unit. All programs running on the STAR computer require rollback (recovery) points. The resident executive provides rollback status storage and controls events which are nonrepeatable i.e., they may not occur more than once even if a rollback takes place. Finally, it implements diagnosis for faulty units to determine the cause and extent of failures for possible partial reuse. The present application programs module includes floating-point arithmetic subroutines, and test and demonstration programs. The applications programs which will be required for space missions are a part of the TOPS control computer subsystem project discussed later in this paper.

Extension of STAR Techniques to Peripheral Systems

The STAR techniques of fault tolerance can be systematically extended beyond the boundaries of the computer to effect automatic maintenance of various peripheral systems that communicate with the computer. The case which was investigated in connection with the STAR computer development is the implementation of automatic maintenance for a simplified model of the JPL thermoelectric outer planet spacecraft (TOPS) which is being proposed for the exploration of the outer planets [Astronaut., 1970], The potentially lower failure rates of unpowered spare units and the constant power demand of a replacement system are exceptionally important in missions requiring a ten year survival of the spacecraft under very strict power constraints.

The methodology of extending the STAR techniques consists of several steps: (1) identification of the replaceable peripheral units; (2) selection of internal error detection functions which are economically feasible within the units themselves; (3) identification of possible functional redundancy, in which either another type of peripheral unit or the computer itself can take over the function of a failed unit; (4) algorithmic description of the monitoring and recovery procedures to be performed for each unit by the computer; (5) development of fault-tolerant communication between the peripheral units and the I/O and interrupt processors of the computer; (6) translation of the monitoring and recovery procedures which have been assigned to the computer into computational requirements: speed, instruction set, storage size, input/output and interrupt system complexity; and (7) estimation of reliability and mean life attainable for each peripheral unit. Several iterations of the design process lead to a system for which a balanced gain in reliability has been attained by means of computer-controlled automatic maintenance. A detailed case study of the application of these techniques is presented in Gilley [1970].

The investigation has identified and quantized the computing capability required from the STAR computer in order to effect the automatic maintenance of the TOPS spacecraft. Furthermore, the results have shown that: (1) the fully automatic maintenance of a complex long-life spacecraft is feasible through a systematic extension of STAR techniques, and (2) the automatic maintenance requirements of the spacecraft systems can be algorithmically described to the detail required to produce computer programs for their implementation. The results of the investigation have systematically extended dynamic redundancy to various peripheral subsystems of an information processing system. Beyond the specific example of a spacecraft, the methodology is applicable to computer-controlled automatic maintenance of other complex data processing, communication, and control systems.

Design of the TOPS Control Computer

The most recent step in the development of the STAR computer concept has been the design of a control computer subsystem (CCS) for the thermoelectric outer planet spacecraft (TOPS) [Astronaut., 1970]. After the TOPS requirements were quantified as described in the preceding section, the CGS design had still to meet four major externally-imposed constraints: (1) the weight of the subsystem was not to exceed 40 lb; (2) power consumption was not to be greater than 40 W; (3) probability of successfully completing a 100,000 h mission was to be equal to or greater than 0.95 (using TOPS approved part failure rates, and (4) it could not, as a consequence of any single internal fault, result in a failure mode catastrophic to the mission.

Because of these constraints, it was not possible merely to "shrink" the STAR computer into a flight package. The STAR design was simplified by retaining only the capabilities needed to meet the TOPS functional requirements. The entire self-test and repair ability of the larger machine has been retained; in fact, the TOPS CCS has expanded failure detection and recovery capability. A variety of advances arising from the years of work on the STAR computer that preceded the TOPS effort have been incorporated into its design.

The CCS operates at a clock frequency of 500 kHz. The CCS word is the same length as the STAR word, 32 bits. The word-processing cycle, ten byte-times long in the STAR computer, has been reduced to nine in the CCS: eight for processing or transferring information and one (two in STAR) for the messages and decision making between words. The execution (including fetch) of an instruction requires one to three cycles. The STAR instruction set with over 200 variants has been reduced to less than 100. To detect word errors, the CCS uses the same residue code as the STAR computer. Unlike the STAR, however, the CCS employs the residue encoding also for operation codes of instructions. In addition to these failure detection measures, the CCS

previous | contents | next