Scaling Clustering Algorithms to Large Databases

  • P.S. Bradley ,
  • Usama Fayyad ,
  • Cory Reina

KDD'98 Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining |

Published by American Association for Artificial Intelligence

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained memory, and regions that are discardable. The algorithm operates within the confines of a limited memory butter. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory butter and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on sysnthetic and publicly available data sets.