Koh-i-Noor is an investigation into, and an attempt to build, large reliable disk arrays. We hope to use some slightly novel erasure codes to provide adequate reliability in the presence of disk failures for storage systems comprised of up to ten thousand disks.
As of August, 2003, Koh-i-Noor is no longer under active development, although we continue to stumble on interesting related results and ideas.
In the Koh-i-Noor project, we are investigating how to construct large, inexpensive, reliable disks. We are using somewhat-novel erasure codes (similar to standard Reed-Solomon codes) and parallel reconstruction techniques during repair. The goal is to allow the construction of extremely large (100 terabyte to 1 petabyte, using today's technology) virtual disks, built by organizing clusters of small disks into modest-sized groups to provide low-level reliability and reduce maintenance costs, without imposing high-overhead on the cost of storage.
In particular, we expect to build clusters of up to 256 disks, each attached to a separate processor. Each processor is connected to two independent networks, organized as a tree using small inexpensive switches. The goal is to provide reliability, not availability, so we assume that processor and/or network reboots will cure many transient failures. Blocks are allocated to the disks using a mapping function that distributes primary storage uniformly; blocks initially mapped to permanently-failed disks are remapped to vacant blocks on surviving disks. Erasure-recovery locations are also assigned by a mapping function. We limit the capacity of a cluster to <85% of the apparent capacity to leave room for correction blocks (~1.5%), and to leave excess capacity to allow re-vectoring of blocks after failure. With roughly 15% overcapacity, we would expect our most-unlucky cluster to still have spare capacity after five years, assuming a MTTF for individual disk drives of 50 years. If we suffer no cabling failures, no dependent failures of hardware, and a similar MTTF for CPUs to those of disks, we might hope to be able to defer any maintenance on any cluster for up to five years.
Triple-erasure-correcting codes give us an expected time of 50,000 years until we experience data loss on any block in the petabyte.
We can attach a greater number of disks to a CPU by assigning them to independent clusters. A CPU or controller failure can then impact several disks, but not ones that rely on the same set of erasure-correction disks. This does have an impact on the statistics of total-system data loss which argues for a greater frequency of servicing.
- You can read some PowerPoint slides, or
- a note on a new construction of a triple-erasure-correcting code, or
- a paper on erasure codes and rapid distributed computation of XOR, or
- a simplified note on data layout in Koh-i-Noor, if you're not yet terminally bored.