previous | contents | next
An Efficient Algorithm for Exploiting Multiple Arithmetic Units1
B. M. Tomasulo
Abstract This paper describes the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units. Basic to these techniques is a simple common data busing and register tagging scheme which permits simultaneous execution of the independent instructions while preserving the essential precedences inherent in the instruction stream. The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code. Instead, the hardware, by "looking ahead" about eight instructions, automatically optimizes the program execution on a local basis.
The application of these techniques is not limited to floating-point arithmetic or System/360 architecture. It may be used in almost any computer having multiple execution units and one or more "accumulators." Both of the execution units, as well as the associated storage buffers, multiple accumulators and input/output buses, are extensively checked.
After storage access time has been satisfactorily reduced through the use of buffering and overlap techniques, even after the instruction unit has been pipelined to operate at a rate approaching one instruction per cycle [Anderson, Sparacio, and Tomasulo, 1967], there remains the need to optimize the actual performance of arithmetic operations, especially floating-point. Two familiar problems confront the designer in his attempt to balance execution with issuing. First, individual operations are not fast enough2 to allow simple serial execution. Second, it is difficult to achieve the fastest execution times in a universal execution unit. In other words, circuitry designed to do both multiply and add will do neither as fast as two units each limited to one kind of instruction.
The first step toward surmounting these obstacles has been presented in Anderson, Earle, Goldschmidt, and Powers , i.e., the division of the execution function into two independent parts, a fixed-point execution area and a floating-point execution area. While this relieves the physical constraint and makes concurrent execution possible, there is another consideration. In order to secure a performance increase the program must contain an intimate mixture of fixed-point and floating-point instructions. Obviously, it is not always feasible for the programmer to arrange this and, indeed, many of the programs of greatest interest to the user consist almost wholly of floating-point instructions. The subject of this paper, then, is the method used to achieve concurrent execution of floating-point instructions in the IBM System/360 Model 91. Obviously, one begins with multiple execution units, in this case an adder and a multiplier/divider [Anderson, Sparacio, and Tomasulo, 1967].
It might appear that achieving the concurrent operation of these two units does not differ substantially from the attainment of fixed-floating overlap. However, in the latter case the architecture limits each of the instruction classes to its own set of accumulators and this guarantees independence.3 In the former case there is only one set of accumulators, which implies program-specified sequences of dependent operations. Now it is no longer simply a matter of classifying each instruction as fixed-point or floating-point, a classification which is independent of previous instructions. Rather, it is a question of determining each instruction's relationship with all previous, incompleted instructions. Simply stated, the objective must be to preserve essential precedences while allowing the greatest possible overlap of independent operations.
This objective is achieved in the Model 91 through a scheme called the common data bus (CDB). It makes possible maximum concurrency with minimal effort (usually none) by the program mer or, more importantly, by the compiler. At the same time, the hardware required is small and logically simple. The CDB can function with any number of accumulators and any number of execution units. In short, it provides a hardware algorithm for the automatic, efficient exploitation of multiple execution units.
The next section of this paper will discuss the physical framework of registers, data paths and execution circuitry which is implied by the architecture and the overall CPU structure presented in Anderson, Sparacio, and Tomasulo . Within this framework one can subsequently discuss the problem of precedence, some possible solutions, and the selected solution, the CDB. In conclusion will be a summary of the results obtained.
Definitions and Data Paths
While the reader is assumed to be familiar with System/360 architecture and mnemonics, the terminology as modified by the
1IBM Journal, vol. 11, January 1967, pp. 25-33.
2During the planning phase, floating-point multiply was taken to be six cycles, divide as eighteen cycles and add as two cycles. Anderson, Earle, Goldschmidt, and Powers  explains how times of 3, 12, and 2 were actually achieved. This permitted the use of only one, instead of two, multipliers and one adder, pipelined to start an add cycle.
3Such dependencies as exist are handled by the store-fetch sequencing of the storage bus and the condition code control described in Anderson, Earle, Goldschmidt, and Powers .
previous | contents | next