Data and computational resources in large data centers are often not managed efficiently. This inefficiency arises due to the vast amount of data in the data center and the complex interplay between the computations that run in the data center and the data they operate on. The Nectar project attempts to solve this problem by synthesizing ideas from a few different strands of prior research: SQL, scalable distributed data structures, and version management systems.
Managing data and the computational resources in a data center is a difficult problem. The typical solution to this problem is to use manual or semi-automated techniques for bookkeeping and managing resource usage. Such approaches tend to be tedious and inflexible at best, and error-prone and wasteful of resources at worst. Our research goal in this project is to explore ways to automate the management of computation and data.
In the case of computation, we endeavor to reuse results from earlier computations rather than run the computation again if the results of the computations are indistinguishable. Alternatively, we try to compute only incremental results of a computation if we have available partial results from an earlier computation. Clearly, avoiding unnecessary computations has a huge efficiency (and energy) win, but achieving it requires sophisticated analysis of the computations and their inputs to determine when it is safe to reuse results.
The key issue we face in the management of data is the large number of temporary results that languish in the system long after they have outlived their usefulness. Although storage costs are getting cheaper (and getting more so every year), the capital cost of wasted storage in a large data center is significant, particularly since each computational result can consume 10s of MBs of storage, as they typically do for many computations. We can minimize the amount of space wasted by reclaiming space occupied by items that we can determine will not be used in the future as well that occupied by seldom-used data that can be easily regenerated by rerunning a computation. Once again this type of data reclamation needs a garbage collector as well as a reliable way of tracking dependencies, which is a hard problem in general.
The Occam's razor that Nectar applies to the data center environment is the unification of all access to data and computation through a single interface. In a Nectar data center, programmers interact with the data center through a query language interface (e.g., .Net LINQ). The results of all queries, referred to as derived datasets, are automatically managed by Nectar, allowing it to identify any derived dataset with the query that produced it. Our belief is that this unification offers the benefits of (a) efficient space utilization, (b) reuse of results, (c) incremental computation, and (d) ease of storage management. Privileged users can import data into the Nectar system without using the language interface; such data is treated as immortal and immutable by Nectar.
Nectar attempts to solve the data and computation management problem within the context of distributed execution engines (e.g., MapReduce, Dryad, Hadoop) and high level language systems (e.g., Sawzall, Pig, SCOPE, DryadLINQ) that form the basis for many large scale data centers in the world today. We have designed and built a prototype that runs on a small cluster of DryadLINQ/Dryad machines. Our experience with this prototype has been positive. We are in the process of gaining more experience by deploying it in our lab's experimental cluster for our colleagues to use in their day-to-day research.
- Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang, Nectar: Automatic Management of Data and Computation in Data Centers, no. MSR-TR-2010-55, May 2010