*
Quick Links|Home|Worldwide
Microsoft*
Search for



Cambridge Cluster
NAS Parallel Benchmarks

Jonathan Hardwick, October 1998

This page shows the performance of the Cambridge cluster on the NAS parallel benchmarks (version 2.3), which are designed to represent the kind of problems that NASA is interested in solving on parallel computers. The Cambridge cluster consists of 16 Windows NT nodes connected by both a Myrinet interconnect and switched Fast Ethernet. Each node contains two PII-300 CPUs (although only one was used for these benchmarks), 128 MB of SDRAM, and a 4 GB EIDE disk.

Summary

The table below summarizes the performance, in aggregate Mops/s, of 16 processors of various machines on the NAS parallel benchmarks (version 2.3, class A problems). The first four entries are for the Cambridge cluster using different interconnects (Myrinet versus Fast Ethernet), MPI implementations (HPVM 2.1 versus MPI/Pro 1.2.0), and compilers (Microsoft Fortran PowerStation 4.0 versus Digital Visual Fortran 5.0a). The cluster was swapping on the FT and SP benchmarks due to lack of memory. The Loki and RWC machines are Beowulf-class clusters, while the Hitachi, Cray, SGI and IBM machines are current-generation supercomputers.

Machine CG EP FT IS MG BT LU SP
CC: Myrinet, HPVM, MFP 231 11.4 217 18.5 504 496 n/a 343
CC: Fast Ethernet, MPI/Pro, MFP 26 11.4 174 4.1 116 251 213 124
CC: Fast Ethernet, MPI/Pro, DVF 19 12.4 236 4.1 123 242 247 125
CC: Fast Ethernet, HPVM, MFP 49 10.6 163 n/a 341 526 n/a 360
Loki: PPro-200, Fast Ethernet n/a 8.8 251 15.1 281 359 453 242
RWC: PPro-200, Myrinet 156 7.0 247 26.7 334 290 448 239
Hitachi SR2201 n/a n/a n/a n/a 974 494 710 n/a
Cray T3E-900 299 41.6 648 35.0 1256 879 1022 643
SGI Origin 2000-195 576 69.7 604 31.1 973 1042 1770 1074
IBM SP2-66WN 314 10.5 710 29.7 923 899 n/a 606

To further illustrate the effect of compilers, I also show results on a single PII-300 node for the serial NAS benchmarks (version 2.3, class W problems). For all these benchmarks I used the optimization flags "/Ox /G5" for Microsoft Fortran Powerstation, and "/Ox /fast" for Digital Visual Fortran.

Compiler CG EP FT IS MG BT LU SP
Microsoft Fortran Powerstation 4 21.9 0.76 18.3 (6.0) 38.7 48.0 57.5 34.0
Digital Visual Fortran 5.0a 23.1 0.77 32.9 (6.0) 39.1 36.9 41.0 32.6

While the performance of the two compilers on the majority of benchmarks is similar, on the FT benchmark the DEC compiler is 1.8 times faster, whereas on the BT and LU benchmarks the Microsoft compiler is 1.3 and 1.4 times faster, respectively. Seen another way: while the DEC compiler is consistently faster on the five kernels, the Microsoft compiler is consistently faster on the three application benchmarks (BT, LU, and SP).

Full Results

For each of the NAS parallel benchmarks, and for the two classes of problem size (A and B), I give the number of nodes on which I've actually run it on a particular combination of network, MPI implementation, and Fortran compiler. If there was an unexpected failure I've listed the symptoms. Otherwise, I include a link to the results file for that run.

Network: Myrinet Fast Ethernet
MPI: HPVM HPVM MPI/Pro 1.1.3 MPI/Pro 1.2.0
Compiler: MFP MFP MFP DVF MFP DVF
CG-A 16, 8, 4, 2, 1 16 16, 8, 4, 2 16, 8, 4, 2, 1 16 16
CG-B Hangs 16
EP-A 16, 8, 4, 2 16 16, 8, 4, 2 16, 8, 4, 2 16 16
EP-B 16, 8, 4
IS-A 16, 8, 4 Crashes Access violation Access violation 16 16
IS-B 16, 8
FT-A 16, 8, 4 16 Verification fails Verification fails 16 16
FT-B 16 swaps
MG-A 16, 8, 4 16 16, 8 16, 8 16 16
MG-B 16, 8
BT-A 16, 9, 4 16 16, 9, 4 16, 9 16 16
BT-B 16
LU-A 16 swaps 16 fails
(step 120)
16 hangs
(step 40)
16 hangs
(step 40)
16 16
LU-B
SP-A 16 swaps 16 16 16 16 16
SP-B

Note the improvement in reliability from MPI/Pro 1.1.3 to 1.2.0. However, there is a corresponding decrease in performance.


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement