A Brief History of Supercomputing: "the Crays", Clusters and Beowulfs, Centers. What Next?

Gordon Bell
Bay Area Research Center
Microsoft Research



After all the attempts and diversions in building high performance computers, "the way" and market reward should be clear: it should be split among the Japanese vector or "Cray" supercomputer suppliers; and proprietary multi-computers a.k.a. clusters, a.k.a. massively parallel special purpose computers, from the few suppliers (e.g. Compaq, Cray, IBM, Intel, SGI, and Sun) that remained after the 20 year search for a new parallel computing paradigm. It isn’t.

After 20 years of false starts and dead ends in high-performance computer architecture, "the way" is now clear: Beowulf clusters are becoming the platform for many scientific, engineering, and commercial applications. Cray-style supers from Japan are still used for legacy or un-partitionable applications code; but, this is a shrinking fraction of supercomputing because such architectures aren’t scalable or affordable. But if the code cannot be ported or partitioned – vector-supers at larger centers are required. Likewise, the top500 share of proprietary MPPs (massively parallel processors), SMPs (shared memory, multiple vector processors) and DSMs (distributed shared memory) that came from the decade long government sponsored hunt for "the" scalable computer is declining. Unfortunately, the architectural diversity created by the hunt assured that a standard platform and programming model could not form. Each platform had low volume and huge software-development sunk-costs and a lock-in to that vendor.

Just two Moore’s-Law generations ago (Bell, 1995), a plethora of vector supers, non-scalable multiprocessors, and MPP clusters built from proprietary nodes and networks formed the market. That made me realize the error of an earlier prediction that these exotic shared memory machines were supercomputing’s inevitable future. At the time, several promising Commercial Off-The-Shelf Technology (COTS) clusters using standard microprocessors and networks were finally beginning to be built. Wisconsin’s Condor to harvest workstation cycles and Berkeley’s NOW, network of workstation were my favorites. They provided one to two orders of magnitude improvement in performance/price over the proprietary systems including their higher operational overhead.

In 2001, the March announcement by "Cray" prompted me to look back at the two, main paths for technical computing "the Cray" vector multiprocessors, and clusters. Beowulf (Sterling, et al, 1999) was finally and clearly succeeding in cluster standardization. This has enabled users to have a common platform and programming model that is independent of proprietary processors and networks, building on a common operating system and tools. This enables almost any high school to build its own computer for a supercomputer application. An applications base and industry is finally forming almost two decades after DARPA initiated its SCI (Scalable Computing Initiative) search in 1983 for a parallel processing paradigm based on many, low cost "killer microprocessors"!

The following sections describe "the Cray" path, Beowulf as the way of the cluster, and the implications for centers. Finally, we speculate about a single path on the future e.g. are there any discontinuities aside from potential bio, optical, and quantum computers to take us off the main path? How will the situation look in two or four generations of Moore’s Law as we approach the petaflops due by 2010? Will it just be larger clusters? Or will clusters have peaked and limited to 10,000 processors, and a petaflops will only be available from the international computing "Grid".

The "Crays"

For more than a quarter of century, "Cray" is synonymous with supercomputers. Starting with Seymour Cray’s tenure at CDC, the Cray brand has moved from one organization to another (Figure 1). In March 2001, the "brand" was transformed to become the U.S. distributor for NEC supercomputers. Announcements of the agreement between Cray Inc. and NEC suggested the merger’s goals are a stronger business model, market presence, and stopping the worldwide decline of vector supercomputers. On the surface the transformation seemed to be a quid-pro-quo: Cray Inc. gets $25 million plus a competitive product to sell in the U.S., and in return, Cray and the U. S. Government will withdraw claims of Japanese supercomputer dumping, making it possible for U.S. users to buy NEC’s highest performance vector processors. The Cray Inc. announcement at www.cray.com was clear: "NEC to Invest $25 Million Cash in Cray Inc.; Long-Term Pact Gives Cray Exclusive Distribution in North America, Non-Exclusive Rights Worldwide; and Deal Aims to Maximize Global Sales of Cray and NEC Vector Products and Address Needs of Underserved U.S. Vector Market."

The NEC agreement add a 4th arrow to a Cray Inc. salesperson’s bulging technical computer quiver (see Fig. 1) that holds:

Cray’s SV1 & SV2 (to come) vector processors that date back to Cray Research aka "the Cray" of 1975

NEC’s line of multiple vector processor computers that can be scaled up to several dozen teraflops by clustering

Cray’s T3E MPP and the follow-on announced February 2001 are 10% of the Top500. (the world’s 500 fastest computers) . They constitute 80% of the Alpha microprocessors in the Top500.

Tera’s MTA1 (GaAs) and the future MT2 (CMOS) multi-threaded, general purpose architecture. Tera is the survivor of the merger with Cray Research.

Given the relative sizes of these product lines, the merger is really a transformation of the Cray brand into an NEC distributorship. This chapter is likely to parallel the Amdahl transformation into Fujitsu. The NEC investment also lets Cray Inc. a.k.a. Tera remain on its path to build another multi-threaded architecture computer. If successful, perhaps the technology can be transferred NEC. Many architects believe (Sterling et al, 1995) that multi-threading is inevitable for future architectures.

The "Cray" has been unlucky in several decision point in its life! This latest move comes near the "three strikes and you’re out" rule. The supercomputer business, like most others, tolerates only a few errors or hobbies, even if there's a profitable business to pay for them. Let’s look at the evolution of the Cray brand… and then what it is facing!

Cray Research forms from Control Data Corporation (CDC)

1960s: CDC was formed in 1957 in Minneapolis, MN. Seymour Cray joined shortly thereafter and was the architect and designer of the CDC 1604 (1962). He moved the lab to Chippewa, WI to design the 6600 (1964) and 7600 (1969) defining the word "supercomputer" as the highest performance scientific computer of the day. In 1963, IBM’s President, Thomas Watson Jr. commented on the announcement of the 6600. "I understand that in a laboratory developing this system there are only 34 people, including the janitor. Of these, 14 are engineers and 4 are programmers. Contrasting this modest effort with our vast developing activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world’s most powerful computer." The 6600 was a land mark innovative design that included: dense packaging and cooling for the fast clock, multiple parallel function units, timesharing a single processor to provide multiple processors (the basic idea of a multi-threaded architecture), I/O computers, and RISC (in contrast to the recently announced IBM System 360). The 7600 improved on the concepts by introducing pipelining, giving it a 20 times speed-up over the 6600 that operated at 1-3 Megaflops.

By 1965 Seymour was clearly recognized as THE world’s supercomputing architect, a title that he clearly held until his 1996 death.

But Cray was at odds with the CDC management, and so decided to start his own company when he abandoned the multiprocessor 8600. When he left, CDC lost its right brain, and it never recovered. CDC attempted to regain its technical capability with the Star 100 (1974) project, and the Cyber 205 (1981). In 1983 it spun out the project forming wholly owned ETA Systems that shipped one nitrogen cooled, CMOS ETA 10 in 1987. CDC shut down ETA in 1989.

1970s: Seymour left CDC, forming Cray Research. In 1976 it delivered the first practical vector processor (Cray 1) using the latest ECL gates and memories, innovative cooling, and the first vectorizing compilers. The Cray 1 operated at a peak speed of 100 Mflops and cost roughly three million dollars. In 2001, micros operate at ten times this speed, and the largest scalable computer using several thousand micros operates at a peak of 10 Teraflops, or 100,000 faster. Centers cost up to 100 times more dollars ($300M).

1985: Seymour’s Cray 2 was introduced. Seymour then started the Cray 3, using GaAs to exploit its faster gate speed. The GaAs and packaging technology development was costing too much and in 1989 he was thrown overboard to form Cray Computer (see the next section).

1980-1995: Working at Cray Research, Steve Chen and Les Davis morphed the vector processor into shared memory, vector multiprocessors: X (82), Y (88), C, J (minisuper), T (94), and the current SV line. In the process, Cray Resarch made at least one too many ECL based computers while the Japanese switched to CMOS, exploiting Moore’s Law.

With waning vector performance because it waited too long to utilize CMOS, Cray and the U.S. Government shutoff the import of NEC’s SX-series computers. Several users (e.g. in 1997 Bill Buzbee at NCAR) attempted to buy the SX-4, but lobbying and "policy" were successful and NSF disallowed the purchase. Shortly, thereafter, the Dept. of Commerce filed dumping charges against NEC. Others suggested a merger: NEC would do high end vectors, and Cray would do MPP… thereby segmenting to focus on the small market.

In the late 80s, Cray Research responded to the call from the ARPA Strategic Computing Initiative to build future computers as massively parallel arrays of rapidly evolving "killer" microprocessors. Furthermore, most of its government users went into a holding pattern awaiting MPPs because money was being ear-marked for these new weird machines in order to develop a market and software. Cray went after MPP with a vengeance, creating the T3D line of MPPs using Digital’s Alpha CMOS microprocessor in 1993 and a proprietary fast, low latency, interconnecting network. The T3D and T3E computers dominated the top 500 computers installations until the late 90s. In 1995 an "E" was the first computer to reach one TeraFlops.

Not wanting to bet it all on just two paths: (old fashioned, vector multiprocessors with legacy software and a programming model); plus an MPP with no software or programming model), Cray Research bought Floating Point Systems, a large scale, shared memory multi-processor that used SUN’s Sparc micro. Thus, it had nearly all bases covered except systems built from commodity micros and networks, and scalable multiprocessors that SGI had bet on.

The Cray Research decision to evolve to three architectures is most likely the single biggest error that caused their downward spiral. A platform requires the continued development of hardware and software to be competitive. Platforms are only valuable if apps are available, requiring extensive porting, testing, training, marketing, sales, and customer support. Once a customer starts using an app, the investment in data and user training swamps nearly all other costs! This implies that unless a company falters badly in successor platform introduction, a customer is "locked in" to a processor x O/S platform (e.g. Sparc x Solaris) for the life an application.

Meanwhile Fujitsu, Hitachi, and NEC adopt the "Cray" Recipe

With the 1970’s introduction of the Cray vector architectures and a good programming model based on parallelizing, vectorizing Fortran compilers, the Japanese computer companies saw a clear plan forward. They simply adopted the vector-super model and built machines for this Fortran programming model. They refined their architectures as technology improved. When CMOS became viable, they were quick to adopt it. In the late 90s, Fujitsu produced the first distributed vector clusters to achieve over 100 Gflops.

In 2001 the various Japanese vector computers dominate technical computing worldwide for apps where legacy programs exist and where fine grain parallelism is required. This amounts to only xx computers in the world’s Top 500 list.

NEC continues to build and market the vector multiprocessor and clustered supercomputer, introducing the SX-5 in 2000 as the world’s fastest vector supercomputer. They promise to deliver a 40 Teraflops computer in 2002.

Cray Research becomes SGI

In 1996 Cray Research "merged with" (was acquired by SGI) because, they rationalized, "both companies sold to the same users" and the market was static or shrinking. Given the product lines duplications, the combined company was headed for bankruptcy, with five platforms to support.

The first act by SGI was to sell the Sparc based multiprocessor to SUN. This act gave true meaning to the words: "A capitalist is someone who’ll sell you the rope to hang him". SUN proceeded to showcase the 64 processor multiprocessor as the Enterprise 10,000. SUN’s marketing created a demand for their premiere-priced E10000 platform, as the worldwide web server market took off. This took potential sales from SGI that was beginning to enter the low end enterprise and web market. In essence SGI "hung itself".

SGI made several attempts to transform, or kill the other parts of the Cray product line, and to merge them with SGI’s MIPS-based scalable multiprocessors. In August 1999, it gave up and Cray Research was made an independent business unit to focus on the high-end supercomputer market and potentially let their MPP and vector lines compete with their MIPS-based, scalable multiprocessor.

Assets of this part of SGI were sold to Tera Computer Company in March 2000. SGI, having bought the Intel story started to move its product line to the 64-bit Intel micros based on the Itanium. In 2001, SGI languishes awaiting a competitive Intel micro.

Meanwhile Cray Computer and SRC Company are formed

In 1989, Seymour Cray founded Cray Computer and Seymour moved to Colorado Springs, CO. Cray Computer was able to make a few processors that operated at a 2 ns (500 MHz) clock. In 1993, a system was delivered to NCAR for test, and he moved on to the Cray 4, that would operate at 1 ns. clock. A prototype processor was made, but no full system was developed, and the company folded. Cray Computer folded in xx.

By 19xx, Seymour R. Cray saw the light, creating the SRC Company with Jim Guzy, to build a large scale, shared memory multiprocessor using the Intel processors. Unfortunately, they believed the schedule for Intel and HP’s extension to the X86, architecture Itanium.

Seymour Cray tragically died in an auto accident in 1996. In 2000, SRC delivered Seymour’s last computer design and is awaiting a 64-bit micro in order to build the next machine and realize Seymour’s design.

Cray Inc. springs from Tera Computer’s Cray Research acquisition

Burton Smith joined Denelcor in 1974 to build the Heterogeneous Element Processor (HEP), extending the basic idea of the 6600 I/O computers to multiprocessing. This first massively multi-threaded architecture (MTA) was delivered in 1982. By 1985, the lack of performance overcame the fact that the HEP was just "interesting" and the company folded. Burton went to SRC, NSA’s research center, where he continued to refine MTA.

In 1987 Burton and Rottsolk found Tera Computer to build another MTA, and in 1995 the company went public. After 10 years the company delivered its first MTA, built on GaAs. An 8 processor system was delivered and used at the San Diego Supercomputer Center.

By the late 90s Tera also "saw the CMOS light" and started to develop a CMOS-based HEP2. The world still awaits this third instantiation of an MTA. The company’s web site reports: "Now, after more than 10 years in development, the MTA systems are on the verge of full scale commercialization."

After several years of zero revenue and a dwindling cash supply and no new story, Tera acquired Cray Research from SGI in March 2000. This allowed the SGI management focus on its main business. It also provided Tera with a new non-zero revenue story and new brand. Equally important, it provided Tera with a maintenance revenue stream and potential sales to the vector and MPP Cray users.

Cray Inc. back to the future: distributing vector supercomputers

In March of 2001 Tera became a distributor for NEC supercomputers. With four architectures, Cray Inc. has nearly all the bases covered except shared memory multiprocessors and low cost commodity clusters e.g. Beowulf. Here’s what we have learned and might expect.

A successful company is unlikely to be able to manage and support more than a single architecture. Cray Inc. will be challenged to support four architectures. Some massive change(s) will have to occur in order to succeed.

For whatever reasons, Cray Research, Cray Computer, and Tera picked the wrong technology (GaAs) or stayed with ECL too long versus recognizing CMOS. This cost them the game as they never mastered the design of VLSI, like their Japanese competitors. Clearly, the failed to understood Moore’s Law and the concept of high volume learning curves.

If what you are doing makes sense independent of government funding, keep on a path and refine like NEC did with their SX series. Software investment dominates the latest, greatest hardware architecture. While the U.S. bought the MPP argument, Japanese vendors stayed their vector course, allowing them to become the vector supplier, to a small niche market. However, the Japanese government involvement to maintain three vendors assures a net negative return.

SGI sold the FPS portion of their Cray Research acquisition to SUN, thereby creating a competitive commercial server at a time when SGI was beginning to enter this market. Dumb luck and good marketing in SUN’s acquisition of the FPS multiprocessor compensated for lack of planning and product development. However, large, shared memory multiprocessors aren’t scalable and they are substantially higher priced than smaller multiprocessors and clusters, hence they are limited to buyers who insist on high prices and a single computer. Furthermore, it kept SUN from evolving to cluster computing,

SGI planned to move from its MIPS architecture to the Intel architecture. Intel is several years behind schedule with the Itanium through, so SGI is struggling. Similarly SRC Company awaits the Intel micro. Putting all your chips on a promised technology is likely to be fatal. More importantly, server companies trying to build large scale servers are waiting for a 64 bit processor.

Maintaining a monopoly with government support is risky as monopolies are likely to be incompetent or inefficient. By declaring dumping, Japanese supers were effectively shut out of the U.S. market for almost two decades. This deprived users of the ability to run legacy programs and forced them into an immature parallel programming model. In the very long term, the world market (i.e. computing needs) will prevail and cause change.

Competition is better than government action; but in the long term, some rationality may come to the aid of users. After huge subsides of research and mandated purchasing of U.S. supplied experimental high performance computers of all types beginning with the 1982 DARPA SCI program and continuing through the 1996 DOE ASCI program, plus restricted imports of supers to support legacy and fine grain apps that U.S. companies have failed to support, the end of all the government action is the NEC SX-5! OUTSIDE of the U.S. where there is no U. S. Government control, the NEC machines are not big sellers compared to SP2s. Most of the non-U.S. Top500s are SP2s (not vector machines).

No matter how much government and industry action (e.g. lobbying) planning, research, development, product strategy, and installed base, serendipity may win! In this case, Beowulf using COTS--commercial off the shelf technology composed of PCs, Linux and Windows makes it possible for any group to buy and build their own supercomputer. Once a foothold was gained, then the world may "tip" to change the course of computing.

Beowulfs: The recipe for building do-it-yourself supers

Beginning with ARPA’s 1983 Scalable Computing Initiative (SCI) to research, design, build, and buy scalable, parallel, computers, a recipe has emerged from the incredible effort to exploit parallel processing, including two score of failed startups who gave their lives in the search. By looking at the evolution, contributions came from the users of the plethora of past and current clusters and MPPs (clusters using unique or proprietary processors and/or interconnection networks), including: Compaq (and Digital), Cray T3 series, IBM SPx, Intel, SGI (to a limited extent).

From all the attempts to build supers using microprocessors, one small (i.e. two-plus person) effort, Beowulf, outside the standard parallel processing research funding track, succeeded. On the other hand, Beowulf utilized the two decades of research and applications from the plethora of products and research that various agencies and users had funded.

Beowulf’s have been the enabler of do-it-yourself cluster computing using commodity microprocessors, the Linux O/S with Gnu Tools and most recently Windows 2000 O/S, tools that have evolved from the MPP research community, and a single platform standard that finally allow applications to be written that will run on more than one computer. A single platform allows a community to form such that users can be trained and software developed that will operate on more than a single vendor’s machine. Since the Beowulf platform has become a standard, with hardware and software available from multiple sources and a constant programming model users have an alternative to "the Cray supercomputer".

The Beowulf Project was based on commodity technology. The project was started in 1993 based on NASA’s requirement for a one Gflops workstation costing under $50,000. The recipe for building your own Beowulf is contained in "How to Build a Beowulf" (Sterling, et al 1999). A 16 node, $40,000 cluster built from Intel 486 computers ran in 1994. In 1997, a Beowulf cluster won the Gordon Bell Prize for performance/price. By 2000, several thousand node computers were operating. In December 2000, 28 Beowulfs were in the 500 top supercomputer centers and the Beowulf population is estimated to be several thousand since any technical high school can buy and assemble its standard parts.

Beowulf’s success goes beyond the creation of an "open source" of the scientific software community. It is based on the two decades of parallel processing research and many attempts to apply loosely coupled computers to a variety of apps. Some of the components include:

Operating system primitives in Linux and Gnu tools that support the platform and networking hardware to provide basic functions

Windows Beowulf to provide a more robust and complete commercial standard

message passing interface (MPI) programming model

various parallel programming paradigms, including Linda, the parallel virtual machine (PVM), and Fortran dialects for both message passing and single stream.

parallel file systems, awaiting transparent database technology

monitoring, debugging, and control tools

scheduling and resource management e.g. Wisconsin’s Condor, Maui’s scheduler

higher level libraries e.g. Linpack, BLAS

Centers versus "do-it-yourself supers": implications

Going back nearly 20 years --the initial NSF supercomputing centers program was established in response to Digital’s VAX minicomputers that enabled each scientist to own and operate their own computer. Although the performance gap between the VAX and a "Cray" could be as large as 100, the performance per price was usually the reverse. Scientists left centers because they were unable to get sufficient computing power compared to a single user VAX. Supercomputer centers were supposedly only to run the giant jobs that were too large for these personal or departmental systems.

Various government studies bemoaned the lack of supercomputer centers. As a result, NSF bought time at existing supercomputers centers (e.g. U. of Minnesota). By 1988 NSF had established five centers, excluding NCAR, at Cornell, U. of IL, UC/San Diego, Pittsburgh (CMU, U. of Pittsburgh and Westinghouse), and Princeton. It was expensive to maintain these centers, and keeping all of them at the leading edge of technology was just unaffordable or justified, especially in lieu of the relatively small number of users. Each center is a national asset and to be competitive has to be comparable to the world’s largest computers. A center has to be at least 2 orders of magnitude larger than what a single researchers needs. Furthermore these researchers were often competing to make break-throughs with their counter-parts in extremely well funded Dept. of Energy Labs. The various DOE labs had been given the mandate to reach Teraflops and now Petaflops levels as soon as possible in order to fulfill their role as the nation’s nuclear stockpile steward. Finally in 1999 in response to these realities, NSF had reduced the number of supercomputing centers to two, in order to be able to consolidate as much power as possible in a center and achieve several Teraflops of performance. The plan was that each year or so, one of the centers would leapfrog the other with new technology in order to keep centers at the forefront of technology and provide services that no single user could afford. In 2001, NSF got back on a path to create more, smaller centers in a policy that is paradoxical to supercomputing. Having more centers assures that users will be deprived of more powerful centers for a given budget level.

The centers idea may already be too late since a center user usually gets 64-128 nodes, comparable to the size Beowulf that most researchers have or can easily build up in their labs (like their VAXen). To be competitive today, a supercomputer center needs to have at least 1,000 new (less than two years old) nodes.

So, a real question is how large is the market for the 100 million dollar computers that go into supercomputer centers, given the fact that hundreds of labs are building supercomputer clusters using the Beowulf recipe? For most application, users can put together there on centers at performance benefits approaching 100x! So the situation favors the individual even more so than when VAXen undermined supers two decades ago. Perhaps another use of these scarce funds is funding a few vector supercomputers in one or more centers to run all the apps that users have invested billions trying either successfully or unsuccessfully to convert to massively parallel processing. Certainly the earth sciences community at NCAR attempted to go this way, but was rebuked.

Future Directions

At a 2000 Conference in Maui, Bill Buzbee, former director of NCAR stated: "Some believe that capability computing is vital to U.S. national security.   If this is correct, then capability computing is a strategic technology and should be accorded commensurate national priority… billions of government funding has gone to development of low-bandwidth-high-latency computing technology and relatively little has gone to development of high-bandwidth-low-latency technology.  Consequently, the February, 1999, report from the President's Information Technology Advisory Committee http://www.itrd.gov/ac/report/pitac_report.pdf , pp 62 recommends:

'There is evidence that current scalable parallel architectures may not be well suited for all applications, especially where the computations' memory address references are highly irregular or where huge quantities of data must be transferred from memory… we need substantive research on the design of memory hierarchies that reduce or hide access latencies while they deliver the memory bandwidths required by current and future applications.'

If implemented, this recommendation will provide a long-term solution to the mismatch in simulation capability between U.S. scientists and their international colleagues. The U.S. computing industry can build capability-computers that match and surpass any in the world."

This is a strong argument for continued funding of research to assure that Moore’s Law will continue to be valid. Based on the last two decades, progress will continue for another decade continuation of Moore’s Law. In doing this most likely performance will come from a hierarchy of computers starting with multiprocessors or a chip. For example, several commodity chips with multiple processing units are being built to operate at 20 Gflops. As the performance of future chips approaches 100 Gflops, only 10,000 would be required.

It is unclear where we will be in another decade with respect to achieving petaflops level computing. In 2001, the world’s 500 top computers consist of about 100,000 processors, each operating at about one gigaflops deliver slightly over 100 teraflops. Most certainly clusters will be one foundation. Similarly, the GRID will be operating using the Internet II.

On the other hand, it is hardly reasonable to expect a revolutionary technology within this time period because we see no laboratory results for near term revolution. Certainly petaflops-level of performance will be achieved by building special purpose computers that IBM has demonstrated.

A interesting scenario occurs when Gigabit and 10 Gigabit Ethernets become the de facto LAN. As network speed and latency increase more rapidly than processing, message passing looks like memory access, making data equally accessible to all nodes within a local area. These match the speed of the next generation Internet. This would mean any LAN based collection of PCs become a de facto Beowulf! Beowulfs and GRID computing technologies will become more closely related to each other than they are now. I can finally see the environment that I challenged the NSF Computer Science Research community to build in 1987!

By 2010, there are several very interesting paths for more power through parallelism that Beowulf could host:

In situ, Condor scheduled, workstations provide de facto clusters, providing scale up of 100-10,000x in many environments as described above.

Large, on chip caches, with multiple processors to give much more performance for single nodes.

Disks with embedded processors in a Network Attached Storage architecture as opposed to Storage Area Networking that connects disks to nodes and require a separate System Area Network to interconnect nodes.

In 2001 a relatively large number of apps can utilize Beowulf technology by "avoiding" parallel programming, including:

Web and Internet servers that run embarrassingly parallel to serve a large client base.

Commercial transaction processing, including inherent, parallelized databases.

Monte Carlo simulation and image rendering that are embarrassingly parallel.

Progress has been great in parallelizing applications (e.g. n-body problems) that had challenged us in the past. The most important challenge is to continue on the course to parallelize those applications heretofore deemed the province of shared memory multiprocessors. These include problems requiring random variable access and adaptive mesh refinement. For example, automotive and aerodynamic engineering, climate and ocean modeling, and applications involving heterogeneous space remain the province of vector multiprocessors. It is essential to have "the list" of challenges to log progress – unfortunately, the vector-super folks have not provided this list.

Although great progress has been made by computational scientists working with computer scientists, the effort to adopt, understand, and train computer scientists in this form of parallelism has been minimal. Few computer science departments are prepared


Bill Buzbee, Larry Smarr, John Toole, Steve Squires, Tom Sterling, and Jack Worlton all made substantial comments to the manuscript. In addition, Peter Denning, encouraged me to write the article after I had written the initial observations about the Cray Inc. deal with NEC. Finally, Jim Gray did the hard work of editing it several times in its evolution.


Bell, G., "The Future of High Performance Computers in Science and Engineering", Communications of the ACM, Vol 32, No. 9, September 1989, pp 1091-1101.

Bell, G., "Ultracomputers: A Teraflop Before Its Time", Communications of the ACM, Vol. 35, No. 8, August 1992, pp 27-45.

Bell, G., "1995 Observations on Supercomputing Alternatives: Did the MPP Bandwagon Lead to a Cul-de-Sac?", Communications of the ACM, Vol. 39, No. 3, March 1996, pp 11-15.


Foster, I. and Kesselman, C. editors The GRID: Blueprint for a New Computing Infrasture, Morgan Kaufman, San Francisco, 1999.

Sterling, T., J. Salmon, D.J. Becker, and D.V. Savarese. "How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters", MIT Press, Cambridge, MA, 1999.

Sterling, T. "Beowulf PC Cluster Computing with Windows" and "Beowulf PC Cluster Computing with Linux", MIT Press, Cambridge, MA, 2001.

Sterling, Thomas; Paul Messina; and Paul H. Smith, "Enabling Technologies for Petaflops Computing", MIT Press, Cambridge, MA, July 1995














Figure 1. Time line of the evolution of vector processing,
"Cray Computers", and Cray Companies.