Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Dryad and DryadLINQ for Data Intensive Research

DryadLINQ: Making Large-Scale Distributed Computing Simple, Microsoft Research Silicon Valley Dryad is a high-performance general-purpose distributed computing engine that is designed to simplify the task of implementing distributed applications on clusters of Windows-based computers. DryadLINQ allows developers to implement Dryad applications in managed code by using an extended version of the LINQ programming model and API.


The Windows High Performance Computing (HPC) team is in the process of adding Dryad and DryadLINQ to its product line. Join the beta program to download the software and give it a try.

Learn More

Background: Dryad and DryadLINQ

The scientist's challenge

You have all the data for an important scientific problem. You just need to analyze it. With terabytes of data, you need a powerful data processing application in order to have results to talk about at the upcoming national meeting. The analysis might be straightforward in principle, but actually doing it is going to be tough.

For many important scientific investigations, efficiently analyzing large data sets is a major challenge. For example, astronomers use the Sloan Digital Sky Survey to investigate prob­lems such as the distribution of “dark matter” around distant galaxies. The current data set—SDSS Data Release 7—covers more than a quarter of the sky and contains more than 50 TB of data representing 357 million unique objects.

Your best bet is to create a distributed application that runs on a cluster of relatively inexpensive networked PCs. However, implementing such an application is a non-trivial task: distributed applications must manage numerous threads, allocate resources across numerous individual multicore computers, handle hardware failures, and so on. Writing the code could take months, and you are a scientist, not a programming expert.

As an alternative to writing all the code yourself…

Microsoft Dryad is a high-performance, general-purpose distributed computing engine that handles some of the most difficult aspects of cluster-based distributed computing. It's powerful: Microsoft routinely uses Dryad applications to analyze petabytes of data on clusters of thousands of computers.

But Dryad applications are not easy to implement. To further simplify things, Microsoft has developed DryadLINQ, which allows developers to use an extended version of the LINQ programming model and API to implement Dryad applications. DryadLINQ code is similar to what you'll see in a conventional LINQ application, and the application core is often only a few lines of code. Behind the scenes though, a DryadLINQ provider automatically converts the LINQ query into a Dryad job and executes the query as a distributed application on a cluster. 

With DryadLINQ and a Dryad cluster…

Even a novice at parallel processing or cluster-based computing can implement a high-performance distributed application to efficiently analyze terabytes of data. As an example, consider one time-consuming problem: Q18 of the Sloan Digital Sky Survey, which searches the data set for possible gravitational lenses:

Q18: Find all objects within 1' of one another that have similar colors, where the color ratios u-g, g-r, r-I are less than 0.05m. Magnitudes are logarithms, so these differences correspond to ratios.

To address this problem, Microsoft researchers used DryadLINQ to run the query on a 40-node Dryad cluster consisting of 40 off-the-shelf networked Windows-based computers. Dryad took about an hour to install.

  • The query itself is a three-way join over two input tables, one with 11 GB of data and the other with 41.8 GB.
  • To perform the query, the team used Microsoft Visual Studio and a standard Windows-based workstation to implement aDryadLINQ application that consists of approximately 100 lines of Microsoft Visual C# code.
  • The team manually distributed the data across the cluster and ran the application from the workstation. The DryadLINQ provider set up the Dryad job and ran the query on the cluster.

The results came back in under two minutes—not even enough time for a quick cup of coffee.