Big Data for Developers

Here at Microsoft Research Silicon Valley we have a history of building large scale distributed systems. As much as we appreciate a nicely wrapped system, we also like to be able to get our hands dirty when the need arises. We've intentionally designed and built our systems to allow, even encourage, tapping in to lower levels and making sure the systems and programs they run are behaving exactly as intended.

This page collects several of our big data projects that fall under the heading of "Big Data for Developers": systems that offer not only pleasant high-level programming models, but enough low-level abstraction layers to let developers to directly interact and adjust the implementations when needed for specialized performance or functionality. These systems provide best-in-breed performance, are fundamentally more expressive than other popular systems, and are compatible with the .NET libraries and tools you know and love.

Using DryadLINQ and Naiad

These systems are beta versions that are research prototypes under continuous development, but we've found them fit for purpose for many challenging data processing tasks. They are all open source Apache 2.0, available for commercial use, and we encourage you to try them out and give us feedback.

To get started using Dryad, visit the Getting Started with DryadLINQ instructions. To get started with Naiad, visit the instructions for writing your first Naiad program. Please send feedback and questions to the Dryad GitHub issues forum and the Naiad GitHub issues forum respectively.

Should I use DryadLINQ or Naiad?

DryadLINQ and Naiad are both .NET frameworks for data-parallel processing; and both can run either locally on a client computer or on an HDInsight 3.0 cluster in Azure. At first glance, they have a similar LINQ-style programming model, however there are substantial differences in their implementations and functionality. The biggest practical difference is their level of maturity: DryadLINQ is a stable system whose interfaces are unlikely to change substantially; and Naiad is a research system whose programming model we are still evolving as we figure out improvements. Both systems are under active development in the sense that we are fixing bugs and tuning performance.

DryadLINQ is an evolution of code that has been running in various guises since 2005, and as such it is well tested and well understood, but is not up to date with the most recent research. DryadLINQ is a good choice for batch computations where the size of the initial dataset or intermediate state is large compared to the amount of RAM in the cluster. The underlying computational model of DryadLINQ is an acyclic dataflow graph, so while DryadLINQ has been used extensively for iterative computations, there are limitations to the kind of iteration it can efficiently support.

Naiad is a more recent system that was designed from the start to support both iterative and incremental computation, and gets excellent performance by holding all working state in RAM. When a computation will fit in memory, Naiad will run it much faster than DryadLINQ. As a result of lessons learned from DryadLINQ, Naiad also better exposes lower layers of abstraction so it is easier to specialize and extend the basic Naiad computational operators for those who want more control than is offered by the LINQ-like programming layers. Naiad is newer code and we have less experience with using it for real applications, so we would love feedback on its strengths and weaknesses.

For those who prefer to think in analogies: DryadLINQ is like a Chevy Suburban. It will get you where you are going in reasonable comfort and speed and can be relied on to get to its destination, even under heavy load. Naiad right now is more like a 1960s Ferrari fitted with a prototype helicopter rotor: when correctly tuned it can drive the same roads as the Suburban but get there a lot faster; it can't necessarily carry the same cargo and there's always a risk of getting stranded if it breaks down; but when you bring out the rotors it can do things the SUV couldn't even imagine. (For comparison, using Hadoop Streaming is like going on a cross-country family vacation in a tractor; HIVE is the same, but in an army jeep; and Spark is like a UK Range Rover—the same general idea as the Suburban but driving on the Java side of the road.)