Kashi Venkatesh Vishwanath and Nachiappan Nagappan
June 2010
Modern day datacenters host hundreds of thousands of servers
that coordinate tasks in order to deliver highly available
cloud computing services. These servers consist of multiple
hard disks, memory modules, network cards, processors etc.,
each of which while carefully engineered are capable of
failing. While the probability of seeing any such failure in
the lifetime (typically 3-5 years in industry) of a server
can be somewhat small, these numbers get magnified across
all devices hosted in a datacenter. At such a large scale,
hardware component failure is the norm rather than an
exception.
Hardware failure can lead to a degradation in performance to
end-users and can result in losses to the business. A sound
understanding of the numbers as well as the causes behind
these failures helps improve operational experience by not
only allowing us to be better equipped to tolerate failures
but also to bring down the hardware cost through
engineering, directly leading to a saving for the company.
To the best of our knowledge, this paper is
the first attempt to study server failures and hardware
repairs for large datacenters. We present a detailed
analysis of failure characteristics as well as a preliminary
analysis on failure predictors. We hope that the results
presented in this paper will serve as motivation to foster
further research in this area.
![]() PDF file |
In Symposium on Cloud Computing
Publisher IEEE
© 2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
http://www.ieee.org/
| Type | Inproceedings |