C:\BELLBOOK\P001-100\HTMFILES\CSP0336.HTM

320 Part 2½ Regions of Computer Space Section 3 ½ Concurrency: Single-Processor System

x_i¬ f (x_i, f_i) (where y_i = 1)

x_{i ¬} x_i (where y_i = 0)

y_i ¬ f (y_i, f_i)

In this case, the old state of Y (before modification by f ) is used as the mask for the X operation.

For a programming example, the basic loop of an unmasked add fields operation is selected. This operation adds the contents of a Field A of all memory words to the contents of a Field B of the words and stores the sum in a Field S of the words. For n-bit fields, the operation executes the basic loop n times. During each execution of the loop, a bit-slice (a) of Field A is read from memory, a bit-slice (b) of Field B is read, and a bit-slice (s) of Field S is written into memory. The operation starts at the least significant bits of the fields and steps through the fields to the most significant bits. At the beginning of each loop execution, the carry (c) from the previous bits is stored in Y, and X contains zeroes:

x_i = 0

y_i = c_i

The loop has four steps:

Step 1: Read Bit-slice a and exclusive-or (Å ) it to X selectively and also to Y:

x_i¬x_iÅy_ia_i

y_i¬y_iÅa_i

The states of X and Y are now:

x_i= a_ic_i

y_i =y_iÅc_i

Step 2: Read Bit-slice b and exclusive-or it to X selectively and also to Y:

x_i¬x_iÅy_ib_i

y_i¬y_iÅb_i

Registers X and Y now contain the carry and sum bits:

x_i=a_i c_iÅa_ib_i Å b_ic_i = c'_i

y_i= a_iÅb_iÅc_i = s_i

Step 3: Write the sum bit from Y into Bit-slice s and also complement X selectively:

s_i¬y_i

x_i¬x_iÅy_i

The states of X and Y are now:

x_i=c'_iÅs_i

y_i = s_i

Step 4: Read the X-register and exclusive-or it into both X and Y:

x_{i ¬}x_iÅx_i

y_i¬y_iÅx_i

This clears X and stores the carry bit into Y to prepare the registers for the next execution of the loop:

x_i= 0

y_i= c'_i

Step 3 takes less than 250nsec, while Steps 1, 2, and 4 each take less than 150 nsec. Hence, the time to execute the basic loop once is less than 700 nsec. If the field length is 32 bits, the add operation takes less than 22.4 microsec plus a small amount of setup time. The operation performs 256 additions in each array module. This amounts to 1024 additions, if four array modules are enabled, to achieve a processing power of approximately 40 MIPS (million-instructions-per-second).

The array module components communicate through a network called the flip network. A selector chooses a 256-bit source item from the MDA memory read bus, the M register, the X register, the Y register, or an outside source. The bits of the source item travel through the flip network, which may shift and permute the bits in various ways. The permuted source item is presented to the MDA memory write bus, M register, X register, Y register, and an outside destination.

The permutations of the flip network allow inter-PE communication, A PE can read data from another PE either directly from its registers or indirectly from the MDA memory. One can permute the 256-bit data item as a whole or divide it into groups of 2, 4, 8, 16, 32, 64, or 128 bits and permute within groups.

The permutations allowed include shifts of 1, 2, 4, 5, 16, 32, 64, or 128 places. One also can mirror the hits of a group (invert the left-right order) while shifting it. A positive shift of mirrored data is equivalent to a negative shift of the unmirrored data. To shift data a number of places, multiple passes through the flip network may be required. Mirroring can be used to reduce the number of passes. For example, a shift of 31 places can be done in two passes: mirror and shift 1 place on the first pass, and then remirror and shift 32 places on the second pass.

The flip network permutations are particularly useful for Fast Fourier transforms (FFT's). A 2ⁿ point FFT requires n steps, where each step pairs the 2ⁿ points in a certain way and operates on the two points of each pair arithmetically to form two new

previous | contents | next