Characterizing Cloud Computing Hardware Reliability

Kashi Venkatesh Vishwanath and Nachiappan Nagappan

Abstract

Modern day datacenters host hundreds of thousands of servers

that coordinate tasks in order to deliver highly available

cloud computing services. These servers consist of multiple

hard disks, memory modules, network cards, processors etc.,

each of which while carefully engineered are capable of

failing. While the probability of seeing any such failure in

the lifetime (typically 3-5 years in industry) of a server

can be somewhat small, these numbers get magnified across

all devices hosted in a datacenter. At such a large scale,

hardware component failure is the norm rather than an

exception.

Hardware failure can lead to a degradation in performance to

end-users and can result in losses to the business. A sound

understanding of the numbers as well as the causes behind

these failures helps improve operational experience by not

only allowing us to be better equipped to tolerate failures

but also to bring down the hardware cost through

engineering, directly leading to a saving for the company.

To the best of our knowledge, this paper is

the first attempt to study server failures and hardware

repairs for large datacenters. We present a detailed

analysis of failure characteristics as well as a preliminary

analysis on failure predictors. We hope that the results

presented in this paper will serve as motivation to foster

further research in this area.

Details

Publication typeInproceedings
Published inSymposium on Cloud Computing
PublisherIEEE
> Publications > Characterizing Cloud Computing Hardware Reliability