Astronomers at The Johns Hopkins University and protein scientists at the University of Washington are using inexpensive computer hardware combined with powerful computing and database software to help manage and analyze a growing volume of scientific data.
Managing Vast Quantities of Scientific Data
Just as the human eye cannot track every flicker of light or shift in color emanating from all the stars in the night sky, scientists in many fields are finding it difficult to comprehend the meaning of a vast and fast-growing universe of information that has become available in recent years as the result of data-intensive computing technologies.
The Sloan Digital Sky Survey is collecting images of many millions of galaxies, individual stars and astronomical spectra. The data is structured to allow scientists to explore the 2-D images and, as illustrated here, use it to create a 3-D map of the night sky. (The Sloan Digital Sky Survey)Advances in high-performance computing (HPC) have enabled researchers to collect and store an enormous volume of data. Nonetheless, researchers in data-intensive areas such as digital astronomy, particle physics, genomics, and protein analysis increasingly need tools to help them process and analyze data from these large quantities of information.
“To put it simply, a scientist needs to be able to live within the data,” says Alexander Szalay, a cosmologist-turned computer-scientist at The Johns Hopkins University (JHU) in Baltimore, Maryland. The power of information, Szalay says, is determined not by its quantity so much as how easy it is to access, manipulate, and analyze.
“It’s not just about doing the numerical calculations,” adds Andrew Simms, a biomedical health informatics graduate student working on protein structure analysis in Valerie Daggett’s bioengineering laboratory at the University of Washington (UW) in Seattle. “It’s also about assembling the data so we can run calculations while performing analyses and ad hoc explorations and then feed it all back into the data warehouse.”
Uniting High-Performance Computing with Data Analysis
As part of an effort to better marry advances in high-performance computing and data analysis, the JHU astronomers and UW protein researchers are partnering with Microsoft External Research to develop a set of software services and design principles known as GrayWulf, which is based on the use of commodity hardware, Windows HPC Server 2008 and Microsoft SQL Server 2008.
GrayWulf builds on the work of Jim Gray, a Microsoft Research scientist and pioneer in database and transaction processing research. It also pays homage to Beowulf, the original computer cluster developed at NASA using “off-the-shelf” computer hardware.
Beowulf helped to transform scientific computing in the mid-1990s by filling in the gap between the desktop PC and the supercomputer. The commodity cluster approach allowed many more scientists, most of whom did not—and still do not— have access to costly supercomputers, to perform HPC tasks on a scalable, parallel-processing “virtual supercomputer” crafted from inexpensive hardware and software.
GrayWulf is an evolutionary extension of the Beowulf cluster and is aimed at helping scientists efficiently process and analyze the massive amounts of information they are collecting. By following Gray’s admonition to “move computation to the data rather than the data to the computation,” GrayWulf demonstrates the utility of databases in helping to manage large amounts of information.
Alexander SzalayFor example, Szalay and his colleagues at JHU recently used the GrayWulf system to win a supercomputing contest, the HPC Storage Challenge at SC08, the International Conference for High Performance Computing, Networking, Storage and Analysis. For the contest, GrayWulf sorted through massive astronomical datasets from the Sloan Digital Sky Survey to identify distant quasars in just 12 minutes—a process that using traditional computational means took 13 days.
It was the massive amount of data coming in from the Sloan survey that first inspired Gray, a decade ago, to begin searching for a way to more closely marry data and computation. “Jim [Gray] said he liked working with astronomers because their data is worthless, in the best possible sense,” Szalay says with a laugh. That is, astronomical data usually has no intellectual property issues, privacy concerns or other complications associated with its use. Moreover, there is lots of it, with more coming in all the time.
Applying GrayWulf to Astronomical Research and More
Szalay and Gray collaborated on a number of projects over the years, combining cutting-edge astronomy and computer science to bring the best to both worlds. One is the Virtual Observatory and the World Wide Telescope, a Web based archive of astronomical images made publicly available by JHU and Microsoft Research using Windows and Microsoft’s Visual Experience Engine. The world’s biggest telescope, Pan-STARRS (for Panoramic Survey Telescope & Rapid Response System), now under construction at the University of Hawaii, has the world’s largest digital camera and will rely on GrayWulf to manage the data. One of the missions of Pan-STARRS will be to identify potential
“Killer asteroids” on a collision course with Earth. The GrayWulf system was first applied in the field of astronomy, but it will have applications in many areas of science and beyond. Protein analysis, with its own space-time universe of complexity at the molecular and atomic scale, is another area of research already benefiting from GrayWulf. To study the function and structure of some 650 proteins— the so-called workhorses of the body—the Daggett Research Group at the University of Washington developed a data warehouse using SQL Server and Microsoft Analysis Services. However, as the group’s efforts have focused increasingly on larger and more complex analyses, it has moved to a hybrid HPC and data warehouse environment, a key concept of GrayWulf.
The function of proteins is determined as much by their three-dimensional shape (typically referred to as “folding”) as by the linear sequence of amino acids in the molecule. A typical protein simulation system (protein and water) is made up of at least 50,000 atoms arranged in a precise 3-D structure, which, in concert with how it moves, determines its function and dysfunction.
Rather than trying to write a program re-creating how the amino acid sequence folds up into the functional structure, Daggett’s team is using GrayWulf to simulate “unfolding” the protein while rapidly monitoring changes in structure and function. They have performed more than 5,000 protein simulations—currently the largest collection of protein simulations and structures in the world. “GrayWulf is allowing us to ask and answer questions relating to protein dynamics and disease that were impossible to tackle with conventional methods,” says Daggett. Jim Gray’s vision, made manifest by the GrayWulf system, offers potential benefits within and well beyond the research arena to meaningfully manage large datasets that are central
A Microsoft Research Connections-funded project supporting advanced technology research