Publication date: November 21, 2012
Today, researchers can collect, store, and analyze tremendous volumes of data; however, technological and storage limitations can severely impede the speed at which they can analyze these data. A new algorithm that was developed by Microsoft Research, called FaST-LMM (Factored Spectrally Transformed Linear Mixed Models), runs on Windows Azure in the cloud and expedites analysis time—reducing processing periods from years to just days or hours. An early application of FaST-LMM and Windows Azure helps researchers analyze data for the genetic causes of common diseases.
“We are taking on the challenge of taking what would be traditional high performance computing, one of the hardest workloads to move to the cloud, and moving it to the cloud.”
- Jeff Baxter, development lead, Windows HPC, Microsoft
Searching for DNA Clues to Disease
The Wellcome Trust in Cambridge, England, is researching the genetic causes of seven diseases—including hypertension, rheumatoid arthritis, and diabetes. The project involves searching for combinations of genomic information to gain insight into an individual’s likelihood to develop one of these diseases. With a database containing genetic information from 2,000 people and a shared set of approximately 13,000 controls for each of the seven diseases, they needed both massive storage and powerful computation capacity.
Jeff Baxter, development lead, MicrosoftThey are storing their vast database of genetic information in the Windows Azure cloud, instead of traditional hardware storage, which represents a profound shift in how big data are stored. ”We are taking on the challenge of taking what would be traditional high-performance computing, one of the hardest workloads to move to the cloud, and moving to the cloud,” observes Jeff Baxter, development lead in the Windows HPC team at Microsoft. “There’s a variety of both technical and business challenges, which makes it exciting and interesting.”
Exploring the Power of the Cloud
“With this new, huge amount of data that’s coming online, we’re now able to find connections between our DNA and who we are that we could never find before.”
— David Heckerman, distinguished scientist, Microsoft Research
Resource management is one of the primary issues associated with big data: not only determining how many resources are required for the project, but also identifying the right type of resources—within the available budget. For example, running a large project on fewer machines might save on hardware costs but result in substantial project delays. Researchers must find a balance that will keep their project on track while working with available resources.
David Heckerman, distinguished scientist, Microsoft ResearchThe FaST-LMM algorithm can analyze enormous datasets in less time than existing alternatives. Microsoft Research also has the infrastructure that is required to perform the computations, explains David Heckerman, distinguished scientist at Microsoft Research. With more CPUs dedicated to a job, computations that would ordinarily take years to finish can be completed in just hours.
For the Wellcome Trust project, the team’s available resources included a combination of Windows HPC Server, Windows Azure, and the FaST-LMM algorithm. The team knew that they had a powerful set of technologies. The question was, could it achieve the results required in the desired timeline?
“For this project, we would need to do about 125 compute years of work. We wanted to get that work done in about three days,” explains Baxter. By running FaST-LMM on Windows Azure, the team had access to tens of thousands of computer cores and an improved algorithm that was able to expedite the work. “You’re still doing hundreds of compute years of work,” he explains, “but with these resources, we can actually do hundreds of compute years in a couple of days.”
While the results were impressive, there was something that had an even bigger impact. “The most impressive thing was how quickly we could take this project from inception to actually completing it and generating new science,” Baxter notes. “This is stuff that, without both the improvements in the algorithms that the Microsoft Research guys had come up with and the ability for us to provide the tens to hundreds of thousands of cores, would have been infeasible.”
The Future for Big Data Research
The Wellcome Trust project is just the beginning of what could be a major shift in how research databases are stored and analyzed. “With this new, huge amount of data that’s coming online, we’re now able to find connections between our DNA and who we are that we could never find before,” Heckerman says. The ability to analyze that data more quickly, and with greater depth, could help scientists make faster breakthroughs in genetic research—and breakthroughs in critical genetic research. The FaST-LMM algorithm running on Windows Azure is helping to accelerate just such breakthroughs.
A Microsoft Research Connections project supporting advanced technology research