Zhe Zhang, Amey Deshpande, Xiaosong Ma, Eno Thereska, and Dushyanth Narayanan
Today replication has become the de facto standard for storing data within and across data centers that process data-intensive workloads. Erasure coding (a form of software RAID), although heavily researched and theoretically more space efficient than replication, has complex tradeoffs which are not well-understood by practitioners. Today's data centers have diverse foreground and background data-intensive workloads, and getting these tradeoffs right is becoming increasingly important. Through a series of realistic data center deployment scenarios and workload characteristics, coupled with the implementation of a prototype Hadoop library with erasure coding functionalities, we revisit traditional metrics (performance and dollar cost), present new tradeoffs (power proportionality and complexity) and make recommendations on directions worth researching.
Publisher Microsoft Research
© 2009 Microsoft Corporation. All rights reserved.