Nectar: Automatic Management of Data and Computation in Datacenters

Managing data and computation is at the heart of datacenter

computing. Manual management of data can lead to data loss,

wasteful consumption of storage, and laborious bookkeeping. Lack of

proper management of computation can result in lost opportunities to

share common computations across multiple jobs or to compute

results incrementally.

Nectar is a system designed to address the aforementioned

problems. It automates and unifies the management of data and

computation within a datacenter. In Nectar, data and computation are treated interchangeably

by associating data with its computation.

Derived datasets, which are the results of computations, are uniquely identified by the

programs that produce them, and together with their programs, are

automatically managed by a datacenter wide caching service.

Any derived dataset can be transparently regenerated

by re-executing its program, and any computation can be transparently avoided

by using previously cached results. This enables us

to greatly improve datacenter management and resource

utilization: obsolete or infrequently used derived datasets

are automatically garbage collected, and shared common

computations are computed only once and reused by others.

This paper describes the design and implementation of Nectar, and reports on

our evaluation of the system using analytic studies of

logs from several production clusters and an actual deployment on a 240-node cluster.

nectar-final.pdf
PDF file

In  Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI)

Details

TypeInproceedings
> Publications > Nectar: Automatic Management of Data and Computation in Datacenters