A decade ago, Tony Hey saw the future of science.
Hey, now corporate vice president of Microsoft Research Connections, had just left a position as dean of engineering with the University of Southampton in the United Kingdom. His new role was with the U.K.’s e-Science Initiative, managing the British government’s efforts to provide scientists and researchers with access to key computing technologies.
Hey knew that physicists—physics was his original specialty—were on the verge of gathering enormous amounts of data from the Large Hadron Collider, then under construction in Geneva. His new role put him in touch with specialists in many other fields, as well.
“That’s when I really became aware of how much data people were starting to collect,” he says. “Particle physicists had lots of data, but it was all of the same character. In a research field such as bioinformatics, you have gene sequences, and you have gene-expression data that tell you what protein the genes express, and then you have protein-structure data, you have pathway data, and you're supposed to find insights and understanding from all those different types of data.”
What Hey realized is that the world is entering an age of “big data”—in which a significant fraction of the massive amounts of information generated by a new generation of experimental instruments, accelerators, and telescopes, collected by sensors or from simulations of supercomputers, will need to be stored in giant databases. Hey borrows from prominent Microsoft Research scientist Jim Gray—who disappeared while sailing in 2007—the phrase “fourth paradigm,” meaning that big data is the fourth methodology for scientific exploration, after theory, experimentation, and computer simulation.
For scientists, access to massive amounts of data is like owning a gold mine. Fields once starved for information now are awash in it. But like a gold mine, finding the significant nuggets of information in the huge volumes of data is the problem. Big data is as much challenge as opportunity.
“When you have data sets as a large as a petabyte, that’s always going to be difficult to move around and analyze,” Hey says. “You really need a lot of local processing power to do analysis on it, and maybe you put it in a database and need to organize it and reorganize it. Maybe you want to visualize it or do some analytics. There are a whole lot of skills you need to manage data of that order.
“Also, the science of big data is about asking the right question, so that scientists and computer scientists collect the right data. It’s not just sifting through data after the fact.”
That’s where Microsoft Research can help. Its researchers are working on tools that can help scientists manage big data and glean new insights in areas such as genetic studies, high-performance computing, and the environment
Computer scientists at Microsoft Research work with scientists from the worlds of medicine, social science, economics, biology, and more.
As one example of those wide-ranging efforts, Hey cites work done in Microsoft Research’s eScience group by David Heckerman, a Microsoft distinguished scientist. Heckerman and colleagues have developed a machine-learning algorithm called FaST-LMM, for “Factored Spectrally Transformed Linear Mixed Model.” It is being used in genome-wide association studies that scan the genetic code in people to find genetic variations associated with diseases such as cancer and diabetes.
The algorithm is more effective than previous algorithms at crunching numbers from a large study because it scales in a linear fashion as the population increases, rather than cubically. The new algorithm can analyze data from 120,000 people in a matter of hours, compared with older algorithms that fail when attempting to evaluate data from 20,000 people.
To advance powerful computing that manages data, a Microsoft Research Silicon Valley team has developed a computing tool called Dryad that simplifies the use of thousands of multiprocessor computers in complex data-analysis applications. Dryad provides reliable computing across thousands of servers. A related technology, DryadLINQ, is built on Microsoft’s .NET Language Integrated Query (LINQ). It enables developers to write their applications in a SQL-like query language, using programming tools such as Microsoft Visual Studio.
Dryad has been used extensively within Microsoft for several years, as well as by university researchers. Earlier this year it, in a collaboration between Microsoft Research and Microsoft’s High Performance Computing group, it became a commercial offering called LINQ to HPC as part of the Windows HPC Server R2 Suite high-performance computing line. For Microsoft, Dryad and DryadLINQ run on thousands of servers that power Bing, sifting through tens of petabytes each day to refine search results.
“It’s an example,” Hey says, “of how we can manage big data not just for academic users, but commercially, as well.”
Hey says environmental work is another area where Microsoft Research scientists are having an impact regarding big data. He cites work to manage large data sets collected from California rivers and aquifers. The goal is to create a water-management cyberstructure to improve understanding of California water quality and how to manage this scarce resource. The project’s data synthesis supports the National Marine Fisheries salmon-recovery planning and U.S. Bureau of Reclamation conservation efforts.
The project works with data gathered from FLUXNET, a global network of weather towers that measure the exchanges of carbon dioxide, water vapor, and energy between the earth’s surface and the atmosphere.
FLUXNET towers in the United States have collected data for many years. Researchers have gathered the data in a database where, among other things, researchers can spot anomalies such as a tower reporting a wide temperature difference compared with nearby towers, indicating a possible sensor fault.
“If the data is dirty, it can be cleaned up,” Hey says. “And rather than send a graduate student out to spend a few miserable hours—or even weeks—retrieving the data, now it can be in one database where a professor can find it with a few keystrokes. That makes a big difference in the kind of science that can be done.”
Such work—which not only collects data, but also determines how to make it readily available—will only become more important. Hey cites as an example the National Science Foundation’s Ocean Observatories Initiative. This $400 million project will cover a section of the Juan de Fuca continental plate, located off the coasts of Washington and British Columbia, with an array of sensors sending data to the mainland via as much as 1,500 kilometers of fiber-optic cables.
This will give oceanographers a continuous stream of data around temperature, salinity, sea-floor earthquakes, and much more.
“Instead of a ship that goes across the surface every few weeks and collects bits of data to take home and analyze,” Hey says, “now this data will pour in 24 hours a day, 365 days a year. So you’ve gone from being data-poor to being flooded with data. You need to have techniques and technologies to pick out the interesting signals.”
Microsoft is helping the ocean initiative’s scientists manage that data stream with Project Trident: A Scientific Workflow Workbench. It’s designed to make complex data visually manageable, enabling science to be conducted at a large scale.
Along with the emergence of data-intensive science, there also is continued exponential growth in scientific discovery and research. About 3,000 scientific articles are published each day, and over the course of a year, those papers will cite 5 million previous articles. Increasingly these articles will be linked to databases, and enabling exploration of this wealth of literature and data is yet another research challenge.
“No one can keep up with that amount of information,” Hey says. “You need to have some understanding when you’re searching that about what papers are relevant: what sources, what databases or web sites.”
What’s needed, he says, is semantic computing, in which a computer has an idea of the context of a search or query and can point a researcher toward relevant information. That can help lead to search results within context, so if someone has been reading film reviews and searches for Casablanca, the results are based on the movie, not the city.
“That’s just the tip of the iceberg, and we’re exploring this in several ways,” Hey says. “How you link data sets, how you join them, how to make use of them, how to mash them up—semantics is going to be important in all of this.”
Hey mentions Gray’s last talk, in which he described a vision of scientific information as a pyramid. At the top of the pyramid sits the scientific literature, at the bottom the raw data.
“Jim’s view, and mine, too,” Hey says, “is that if you can go from the literature to the data, you can combine it with other data—your data, new data—and you can do these scientific mash-ups that lead to new discoveries.
“There’s a new type of scientist emerging—the data scientist,” he adds. “Instead of testing a hypothesis, you’ll find things about it in the data. We’re seeing that now in genomics, of course, but also in astronomy and in environmental science, where there are these huge data sets. This is like no other time in science. That’s why I’m excited to get to my job every day. I think we’re doing really exciting stuff, and it’s all part of realizing that there is one gigantic global library.”