Cambridge Cluster
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Machine | CG | EP | FT | IS | MG | BT | LU | SP |
|---|---|---|---|---|---|---|---|---|
| CC: Myrinet, HPVM, MFP | 231 | 11.4 | 217 | 18.5 | 504 | 496 | n/a | 343 |
| CC: Fast Ethernet, MPI/Pro, MFP | 26 | 11.4 | 174 | 4.1 | 116 | 251 | 213 | 124 |
| CC: Fast Ethernet, MPI/Pro, DVF | 19 | 12.4 | 236 | 4.1 | 123 | 242 | 247 | 125 |
| CC: Fast Ethernet, HPVM, MFP | 49 | 10.6 | 163 | n/a | 341 | 526 | n/a | 360 |
| Loki: PPro-200, Fast Ethernet | n/a | 8.8 | 251 | 15.1 | 281 | 359 | 453 | 242 |
| RWC: PPro-200, Myrinet | 156 | 7.0 | 247 | 26.7 | 334 | 290 | 448 | 239 |
| Hitachi SR2201 | n/a | n/a | n/a | n/a | 974 | 494 | 710 | n/a |
| Cray T3E-900 | 299 | 41.6 | 648 | 35.0 | 1256 | 879 | 1022 | 643 |
| SGI Origin 2000-195 | 576 | 69.7 | 604 | 31.1 | 973 | 1042 | 1770 | 1074 |
| IBM SP2-66WN | 314 | 10.5 | 710 | 29.7 | 923 | 899 | n/a | 606 |
To further illustrate the effect of compilers, I also show results on a single PII-300 node for the serial NAS benchmarks (version 2.3, class W problems). For all these benchmarks I used the optimization flags "/Ox /G5" for Microsoft Fortran Powerstation, and "/Ox /fast" for Digital Visual Fortran.
| Compiler | CG | EP | FT | IS | MG | BT | LU | SP |
|---|---|---|---|---|---|---|---|---|
| Microsoft Fortran Powerstation 4 | 21.9 | 0.76 | 18.3 | (6.0) | 38.7 | 48.0 | 57.5 | 34.0 |
| Digital Visual Fortran 5.0a | 23.1 | 0.77 | 32.9 | (6.0) | 39.1 | 36.9 | 41.0 | 32.6 |
While the performance of the two compilers on the majority of benchmarks is similar, on the FT benchmark the DEC compiler is 1.8 times faster, whereas on the BT and LU benchmarks the Microsoft compiler is 1.3 and 1.4 times faster, respectively. Seen another way: while the DEC compiler is consistently faster on the five kernels, the Microsoft compiler is consistently faster on the three application benchmarks (BT, LU, and SP).
For each of the NAS parallel benchmarks, and for the two classes of problem size (A and B), I give the number of nodes on which I've actually run it on a particular combination of network, MPI implementation, and Fortran compiler. If there was an unexpected failure I've listed the symptoms. Otherwise, I include a link to the results file for that run.
| Network: | Myrinet | Fast Ethernet | ||||
|---|---|---|---|---|---|---|
| MPI: | HPVM | HPVM | MPI/Pro 1.1.3 | MPI/Pro 1.2.0 | ||
| Compiler: | MFP | MFP | MFP | DVF | MFP | DVF |
| CG-A | 16, 8, 4, 2, 1 | 16 | 16, 8, 4, 2 | 16, 8, 4, 2, 1 | 16 | 16 |
| CG-B | Hangs | 16 | ||||
| EP-A | 16, 8, 4, 2 | 16 | 16, 8, 4, 2 | 16, 8, 4, 2 | 16 | 16 |
| EP-B | 16, 8, 4 | |||||
| IS-A | 16, 8, 4 | Crashes | Access violation | Access violation | 16 | 16 |
| IS-B | 16, 8 | |||||
| FT-A | 16, 8, 4 | 16 | Verification fails | Verification fails | 16 | 16 |
| FT-B | 16 swaps | |||||
| MG-A | 16, 8, 4 | 16 | 16, 8 | 16, 8 | 16 | 16 |
| MG-B | 16, 8 | |||||
| BT-A | 16, 9, 4 | 16 | 16, 9, 4 | 16, 9 | 16 | 16 |
| BT-B | 16 | |||||
| LU-A | 16 swaps | 16 fails (step 120) |
16 hangs (step 40) |
16 hangs (step 40) |
16 | 16 |
| LU-B | ||||||
| SP-A | 16 swaps | 16 | 16 | 16 | 16 | 16 |
| SP-B | ||||||
Note the improvement in reliability from MPI/Pro 1.1.3 to 1.2.0. However, there is a corresponding decrease in performance.