Researchers and businesspeople around the world now have at their disposal a new way to perform massive computations over large quantities of unstructured data more quickly and easily than they’ve ever imagined.
The reason: a Microsoft Research-developed computing tool called Dryad, a name derived from shy tree deities found in Greek mythology. Dryad and a related programming model called DryadLINQ constitute technology that simplifies running complex data-analysis applications across hundreds or even thousands of servers on familiar, widely used Windows software.
After nearly six years of research into Dryad and DryadLINQ—as well as its use in-house on Microsoft projects such as Kinect and Bing—Dryad and DryadLINQ are entering commercial use. Starting Jan. 26, a technology preview of Dryad and DryadLINQ will be built into the Windows HPC Server 2008 R2 high-performance computing line and eventually will be integrated with Microsoft SQL Server and Windows Azure. HPC Server is designed to give customers tremendous computing power and an easy management experience, all using off-the-shelf hardware.
“This is an opportunity to democratize large-scale, data-intensive computing,” he says. “In areas such as customer-relationship management, business intelligence, planning, and infrastructure—all those tasks where companies now have access to a vast amount of data—Dryad and DryadLINQ can make sense of that data.”
The Dryad project consists of two key components. The Dryad tool itself provides reliable computing across thousands of servers. DryadLINQ, built on Microsoft’s .NET Language Integrated Query (LINQ), enables developers to write their applications in a SQL-like query language, using familiar programming tools such as Microsoft Visual Studio. Most programmers will work only with DryadLINQ; once they have launched their application into the cloud, Dryad will do the rest, invisibly.
A third piece, the Distributed Storage Catalog (DSC), is a distributed file system built for Dryad. It manages the data that Dryad is processing, keeping it stored reliably and safely with user-configurable redundancy. The DSC also keeps the data close to the servers processing it, so time is not wasted transmitting the data to a server.
Dryad and DryadLINQ make it easier for programmers to take advantage of the power of parallel computing, in which rows of servers or multicore processors within a single machine tackle a single computing problem. Such computing is extremely powerful, especially with so-called “unstructured” data such as information on buying habits that a retailer might collect from tens of thousands of customers but that has not been tagged or annotated, in contrast to structured data found, for instance, in a SQL database.
It is difficult, though, to harness the power afforded by parallel computing. Most programmers are more familiar with writing sequential programs, in which Action A is followed by Action B, then Action C. It is challenging to think and program in parallel.
While DryadLINQ enables developers to write their applications in a query language using Visual Studio, Dryad breaks up the program and assigns it across clusters of servers or processors. In effect, Dryad acts as a computing traffic cop, sending data down potentially millions of computing pathways. It helps make sure that when one piece of data is modified, other servers don’t also change that data. It balances the computing load between many computers, and it re-routes computing traffic if an error or communications problem temporarily takes one or even several servers offline.
That removes a huge burden from programmers and lets them focus on the problem they are trying to solve, not how the computers will act in parallel.
“We want programmers to be able to write their programs without having to think about things like fault tolerance [a byproduct of parallel computing’s complexity],” says Yuan Yu, a principal researcher at Microsoft Research Silicon Valley who led the creation of the DryadLINQ component.
“We want them to be able to write sequential and declarative code, and then, that same code can be run on a single machine, on a multicore machine, or on a cluster of machines. That’s the beauty of the DryadLINQ programming model.”
A second benefit is that Dryad gives programmers supercomputer-level power with everyday programming tools and relatively inexpensive hardware.
“This is a much cheaper way of doing things,” Yu says. “Everything is a commodity—a commodity operating system, using commodity servers and switches. Dryad deals with the reliability and the bandwidth issues.”
Dryad also utilizes Microsoft’s big investment in the cloud. As Dryad is integrated with Azure, all a programmer will need to take advantage of Dryad is a client and an Azure connection. Whether they are working on a cluster or the cloud, programmers can store their data and then manipulate it through their DryadLINQ-written applications. On a cluster, the DSC unit manages the data to keep it close to the processors working on it, so time is not lost in communicating data between servers.
“The only thing we’ll give the customer is some client software for writing DryadLINQ programs,” Isard says. “They’ll basically write the program on their machine and submit it to Windows Azure, where Dryad is running internally.”
Dryad had its roots in an idea developed in October 2004 by Isard—then working on search for Microsoft—when he recognized the need for a large-scale data-intensive computation platform and began discussions with researchers at Microsoft to build on the idea.
Not long afterward, the newly created Dryad came into widespread use within Microsoft’s search offering, where it was used on thousands of servers. But while the tool worked well, the programming interface was awkward. Yu recognized the potential of LINQ to serve as the front-end programming tool for Dryad, and started the DryadLINQ project in September 2006. By early 2008, the Dryad/DryadLINQ combination was made available within Microsoft. A release to a small collection of academic researchers followed. Dryad also was adopted as a key tool for the development of the Xbox 360 Kinect gaming device. The DryadLINQ research paper won a best-paper award in 2008 during the eighth USENIX Symposium on Operating Systems Design and Implementation.
“It was easily the largest project in our lab,” Yu says. “And this was a long-term project, so management had to believe in it. But they said, ‘We believe in you guys, so here is the money you need to build a server cluster to do the research.’ Also, the entire lab was very supportive—we built the (Dryad) system, and many researchers are using it for real work. Their feedback, in particular, has been invaluable in refining the DryadLINQ programming model.”
Isard adds that while it might seem Dryad had a long gestation, the market time for its release is right.
“I think the HPC product group moved at the right time—when they saw the opportunity,” he says. “We were a year or two ahead of the curve on the research side, but we were ready when the product group saw a need for it.”
A big step is coming, as Dryad and DryadLINQ become fully productized as part of the Microsoft HPC Server suite. It also will be integrated with Microsoft SQL Server and Windows Azure to give customers from academia to the business community a new, powerful computing tool.
Isard is confident that Dryad’s ease of use and familiar Microsoft tools will win over developers.
“Dryad will particularly appeal to customers who would love to keep using Windows and Excel and Visual Studio and all the tools they already use,” he says, “and need a technology for unstructured data analysis that really scales.”
John Dunagan, a principal architect for Microsoft’s High Performance Computing group, thinks HPC Server customers who use Dryad will find that they now can solve problems that had been challenging.
“We’re convinced that we will delight our customers, both with the pure capability of the system, as well as its ease of use,” he says. “What I really like about Dryad is that is not just about handling a problem in a better way, it is also about new possibilities in computing that you couldn’t imagine before.”
The Microsoft Research team that worked on Dryad is pleased to see its project in a position to seek a larger audience.
“Offering an easy-to-use but powerful, data-intensive computing tool is exciting to see,” Isard says. “It will benefit a whole new set of Microsoft customers.”