previous | contents | next

320 Part 2½ Regions of Computer Space Section 3 ½ Concurrency: Single-Processor System


xi ¬ f (xi, fi) (where yi = 1)

xi ¬ xi (where yi = 0)

yi ¬ f (yi, fi)

In this case, the old state of Y (before modification by f ) is used as the mask for the X operation.

For a programming example, the basic loop of an unmasked add fields operation is selected. This operation adds the contents of a Field A of all memory words to the contents of a Field B of the words and stores the sum in a Field S of the words. For n-bit fields, the operation executes the basic loop n times. During each execution of the loop, a bit-slice (a) of Field A is read from memory, a bit-slice (b) of Field B is read, and a bit-slice (s) of Field S is written into memory. The operation starts at the least significant bits of the fields and steps through the fields to the most significant bits. At the beginning of each loop execution, the carry (c) from the previous bits is stored in Y, and X contains zeroes:

xi = 0

yi = ci

The loop has four steps:

Step 1: Read Bit-slice a and exclusive-or (Å ) it to X selectively and also to Y:

xi¬ xiÅ yiai

yi¬ yiÅ ai

The states of X and Y are now:

xi = aici

yi =yiÅ ci

Step 2: Read Bit-slice b and exclusive-or it to X selectively and also to Y:

xi¬ xiÅ yibi

yi¬ yiÅ bi

Registers X and Y now contain the carry and sum bits:

xi=ai ciÅ aibi Å bici = c'i

yi= aiÅ biÅ ci = si

Step 3: Write the sum bit from Y into Bit-slice s and also complement X selectively:

si¬ yi

xi¬ xiÅ yi

The states of X and Y are now:

xi=c'iÅ si

yi = si

Step 4: Read the X-register and exclusive-or it into both X and Y:

xi ¬ xiÅ xi

yi¬ yiÅ xi

This clears X and stores the carry bit into Y to prepare the registers for the next execution of the loop:

xi= 0

yi= c'i

Step 3 takes less than 250nsec, while Steps 1, 2, and 4 each take less than 150 nsec. Hence, the time to execute the basic loop once is less than 700 nsec. If the field length is 32 bits, the add operation takes less than 22.4 microsec plus a small amount of setup time. The operation performs 256 additions in each array module. This amounts to 1024 additions, if four array modules are enabled, to achieve a processing power of approximately 40 MIPS (million-instructions-per-second).

The array module components communicate through a network called the flip network. A selector chooses a 256-bit source item from the MDA memory read bus, the M register, the X register, the Y register, or an outside source. The bits of the source item travel through the flip network, which may shift and permute the bits in various ways. The permuted source item is presented to the MDA memory write bus, M register, X register, Y register, and an outside destination.

The permutations of the flip network allow inter-PE communication, A PE can read data from another PE either directly from its registers or indirectly from the MDA memory. One can permute the 256-bit data item as a whole or divide it into groups of 2, 4, 8, 16, 32, 64, or 128 bits and permute within groups.

The permutations allowed include shifts of 1, 2, 4, 5, 16, 32, 64, or 128 places. One also can mirror the hits of a group (invert the left-right order) while shifting it. A positive shift of mirrored data is equivalent to a negative shift of the unmirrored data. To shift data a number of places, multiple passes through the flip network may be required. Mirroring can be used to reduce the number of passes. For example, a shift of 31 places can be done in two passes: mirror and shift 1 place on the first pass, and then remirror and shift 32 places on the second pass.

The flip network permutations are particularly useful for Fast Fourier transforms (FFT's). A 2n point FFT requires n steps, where each step pairs the 2n points in a certain way and operates on the two points of each pair arithmetically to form two new

previous | contents | next