previous | contents | next

324 THE PDP-11 FAMILY

RELIABILITY AND MAINTAINABILITY

Design decisions to allocate a portion of the cost of the 11/60 to reliability and maintainability, rather than to further improving performance, were motivated by user and market needs. Prime considerations were the increasing labor cost associated with maintenance and the growing use of minicomputers in applications demanding more reliability.

The first goal was to increase the mean time between failures (MTBF) by: (1) reducing the occurrence and impact of normally fatal hardware malfunctions, (2) providing error statistics, and (3) providing operating alternatives to keep the system running after failures occur, albeit at a lower performance.

The second goal was to reduce the mean time to repair (MTTR) when hardware malfunctions occur by: (1) hardware design and packaging that facilitate error diagnosis and repair during scheduled and nonscheduled maintenance, (2) continuous logging of hardware errors during system operation, and (3) provision of software and microdiagnostic tools for problem isolation.

MTBF

Reducing the incidence of fatal hardware malfunctions was a joint effort by engineering and manufacturing. The Schottky transistor- transistor logic (TTL) used in the machine, having been in widespread use for over five years, is a well proven family of devices. Moreover, conservative electrical design practices were followed.

Plotted against time, chip failure rate tends to follow a bathtub-shaped curve, high at either end of the life cycle. The 11/60 production process includes extensive thermal cycling to ensure that "infant mortality" cases are discovered early during manufacturing.

The cabinet is designed to minimize buildup of hot air over the processor boards. Power sup plies are mounted at the rear of the cabinet, away from the logic, so that radiant heating effects are minimized. A blower system cools the logic card cage by drawing fresh, filtered air down over the printed circuit boards such that no board receives exhaust air from another.

Other physical packaging to reduce hardware problems include cable troughs, impact- absorbing casters, and special cabinet grounding. A filter is attached to the maintenance con sole to reduce electrostatic noise interference.

Console microcode double checks every entry to verify data received from the keypad. A significant proportion of the 11/60 microcode (Table 1) is devoted to logging microlevel state upon the occurrence of a detected error. This logged state can be accessed via a maintenance examine and deposit (MED) instruction. Logged information is used by an operating system to compile error records, which aid in tracking down intermittent errors.

To reduce the impact of hardware malfunctions on the user environment, a number of fail- soft capabilities have been implemented.

1. If the cache fails, it is turned off and the still-functioning primary memory is used to keep the system running.

2. If a parity error occurs in WCS, the processor disables that control store. Then the operating system is notified, and program execution can continue using the basic PDP-l 1 instructions.

3. Systems can be programmed to fall back onto the integral floating-point unit if an error is detected in the floating-point processor.

4. The bootstrap loader permits system loading from an alternative device if the primary bootstrapping device is disabled.

MTTR

Error diagnosis is the most time-consuming problem facing the field service engineer. Special diagnostic tools, both hardware and software, have been designed to reduce the time spent in error isolation.

previous | contents | next