The goal of this project is to construct scalable, distributed, and fault-tolerant data abstractions.
The goal of this project is to explore the utility of data structures or abstractions as storage infrastructure. Storage infrastructure that allows incremental expansion has obvious economic benefits. However, incremental expansion carries with it a slew of attendant problems, e.g., scalability, automatic reconfiguration for load and capacity balancing, fault tolerance, and availability. Recent trends in storage system infrastructure argue for using distribution and virtualization as a way of managing the complexity inherent in such systems.
We believe data abstractions is an appropriate mechanism to implement storage infrastructure while preserving the benefits of virtualization and distribution. In contrast to prior work on storage infrastructure that has focused on exposing storage as a simple linear abstraction (either a physical disk, or a logical disk, or a sparse virtual disk), we wish to export higher level abstractions, e.g., trees, hash tables, linked lists. Clients of the system, e.g., file systems, data bases, and indexing engines access the system over the network and can create multiple instances of each abstraction. Each instance allows a set of operations that are appropriate for the abstraction.
Higher-level abstractions to storage have some advantages over simple disk-like abstractions. First, it has the potential of reducing the bookkeeping done by the client (e.g., if the free list is maintained by the system, the client can be freed of this responsibility). Second, using the structural information inherent in the data abstraction can allow the system to do better load-balancing, data prefetching, and informed caching. These mechanisms can be implemented once in the system instead of having to be duplicated in each client.
We envision the abstractions to be implemented by a collection of "server nodes", where each server node contains a CPU, RAM, disk, and network interfaces. One can think of server nodes ranging from individual disk units, to disk controllers, to ordinary PCs. For example, if we have disks that are directly pluggable into the network, then one can imagine using the CPUs on the disk itself. A more realistic alternative may be to use the CPU in a disk controller that plugs into the network. Yet another alternative is to incorporate the software into a set of networked PCs attached to ordinary disks via SCSI or IDE.
Our early experience convinces us that there is no one universal abstraction that can serve the needs of all clients. We are currently implementing a small selection of abstractions including a B-Link Tree (a variant of the B-Tree data structure). Each B-Link tree allows the usual operations of addition, deletion, lookup, enumeration, etc. The utility of building such a system is that it can form the basis for many higher level services such as file systems, data base engines, and is widely believed to simplify the design of clustered Internet services. A second advantage of using a B-Link tree as an exemplar is that it is sufficiently complex that it exposes fundamental issues in the design of scalable, distributed, data abstractions.
Publications and Presentations
Click here for an early PowerPoint presentation about Boxwood.
Research Contributions and Related Work
Building scalable distributed data structures such as the ones we are proposing has received some attention in the literature. For the most part, hash tables have been the data structure considered. Various schemes for building hash tables that run on a loosely coupled set of machines were proposed by Litwin. However, Litwin primarily focuses on the algorithms rather than on dealing with the fault-tolerance of hash tables per se.
Gribble et al. describe a fault-tolerant and scalable hash table implementation. While a hash table is a useful data structure, it is not the only abstraction, and may not always be the best one for the application at hand. Gribble et al. propose building a B-tree as future work, but do not seem to have done so yet.
It is well known that it is hard to build even a centralized B-Link tree that works at a high level of concurrency. As far as we know there are no reports of distributed implementations of B-Link trees in the literature. Our strategy is to extend algorithms for a single-machine implementation of B-link trees to the distributed environment. B-link trees are a variant of B-trees that have some properties that simplify locking in a distributed environment. To our knowledge, no implementations of B-link tree algorithms have been described in the literature, and especially not in a distributed environment.
We initially completed single-machine implementations of our code in C# and in C. This was done to understand (a) the performance differences between programs written in C and C# and (b) the suitability of C# as a systems programming language. Our initial conclusion is that C# appears to be a suitable choice for us. To be sure, some parts of our system that deal with raw disks and the networks are written in C, but the rest of the system is written in C# and works quite well.
We have our code running on a small 8-node cluster. Each node is a PC class machine with hot-pluggable SCSI disks and Gigabit Ethernet links.
Nick Murphy built a multi-node NFS server that runs on the cluster using Boxwood as the basic store instead of a local file system.
Boxwood is also used as the file system for the Singularity kernel being built in Microsoft Research.
Several people have asked us for source code availability. Unfortunately, we don't have any source code we can give people outside Microsoft at the moment.