Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Scaling Clustering Algorithms to Large Databases

P.S. Bradley, Usama Fayyad, and Cory Reina


Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained memory, and regions that are discardable. The algorithm operates within the confines of a limited memory butter. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory butter and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on sysnthetic and publicly available data sets.


Publication typeInproceedings
InstitutionMicrosoft Research
PublisherAmerican Association for Artificial Intelligence
> Publications > Scaling Clustering Algorithms to Large Databases