Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
CloudClustering: Toward an iterative data processing pattern on the cloud

Ankur Dave, Roger Barga, Wei Lu, and Jared Jackson

Abstract

As the emergence of cloud computing brings the potential for large-scale data analysis to a broader community, architectural patterns for data analysis on the cloud, especially iterative algorithms, are increasingly useful. MapReduce suffers performance limitations for this purpose as it is not inherently designed for iterative algorithms.

In this paper we describe our implementation of Cloud-Clustering, a distributed k-means clustering algorithm on Microsoft’s Windows Azure cloud. The k-means algorithm makes a good case study because its characteristics are representative of many iterative data analysis algorithms. CloudClustering adopts a novel architecture to improve performance without sacrificing fault tolerance. To achieve this goal, we introduce a distributed fault tolerance mechanism called the buddy system, and we make use of data affinity and checkpointing. Our goal is to generalize this architecture into a pattern for large-scale iterative data analysis on the cloud.

Details

Publication typeInproceedings
Published inProceedings of IEEE DataCloud 2011
PublisherIEEE
> Publications > CloudClustering: Toward an iterative data processing pattern on the cloud