Naiad is an investigation of data-parallel dataflow computation in the spirit of Dryad and DryadLINQ, but with a focus on incremental computation. Naiad introduces a new computational model, differential dataflow, operating over collections of differences rather than collections of records, and resulting in very efficient implementations of programming patterns that are expensive in existing systems.
News: Naiad is now available on Github under the Apache open-source license. The Naiad SOSP paper is also available. You can read more about Naiad on the MSR SVC Big Data blog, and watch an interview about Naiad on Channel 9.
Our goal with Naiad was to address one of the recurring requests for systems like Dryad and DryadLINQ, incremental recomputation, but in so doing found that the necessary mechanisms gave rise to a new computational model, timely dataflow, capable of efficiently processing substantially more complex computations than current systems support, namely incremental and arbitrarily nested iterative dataflow computation.
Consider the problem of computing the connected component structure of a graph. One natural iterative data-parallel approach has each vertex assume a label (initially their own name) and repeatedly share their label with their neighbors, assuming the least label in their neighborhood. Eventually, all vertices in the same component will be labeled with the name of the least vertex in their component. Several data-processing systems make this sort of iterative computation easy to write and efficient to execute.
However, what happens if the input changes? Perhaps a single edge is removed, which can result in the separation of two previously connected components. The labels above are invalidated, and it is not easy to determine how to unwind their propagation to return the computation to a state from which new correct labels can be determined. The data-processing systems alluded to above are forced to discard the results of their previous computation and start over from scratch.
Naiad, by comparison, represents a dataset in a compact form indicating where and when records have changed. The specific representation enables efficient combination of incremental and iteration computation, and allows us to update computations like the connected components example above in a fraction of a second. Naiad is currently capable of maintaining the strongly connected component structure (a doubly-nested loop) of a graph defined by a sliding window over edge stream with rates exceeding Twitter's full tweet volume, all with sub-second latency.
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi, Naiad: A Timely Dataflow System, in Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP), ACM, 3 November 2013
- Martin Abadi, Frank McSherry, Derek G. Murray, and Thomas L. Rodeheffer, Formal Analysis of a Distributed Algorithm for Tracking Progress, in FMOODS-FORTE'13: 15th Formal Methods for Open Object-Based Distributed Systems and 33nd Formal Techniques for Networked and Distributed Systems, Springer, 3 June 2013
- Thomas L. Rodeheffer, The Naiad Clock Protocol: Specification, Model Checking, and Correctness Proof, no. MSR-TR-2013-20, 12 February 2013
- Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard, Differential dataflow, in Proceedings of CIDR 2013, 5 January 2013
- Frank McSherry, Rebecca Isaacs, Michael Isard, and Derek G. Murray, Composable Incremental and Iterative Data-Parallel Computation with Naiad, no. MSR-TR-2012-105, 9 October 2012