Fast Dual-Tree k-means with Bounded Single-Iteration Runtime
k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N, each iteration of Lloyd’s algorithm costs O(kN) time. Although there are a handful of techniques to accelerate single Lloyd iterations, none of these techniques are tailored to the case of large k, which is increasingly common as dataset sizes grow (one example: vector quantization), and none of these techniques have a worst-case runtime bound of less than O(kN) per iteration. I present a dual-tree algorithm with a worst-case runtime bound of O(N + k log k) under certain assumptions on the dataset, and show that this algorithm outperforms all other alternatives as k and N grow. An implementation is readily available in mlpack.
Speaker Details
Ryan Curtin is a final-year Ph.D. candidate at the Georgia Institute of Technology. His primary research interests are the acceleration of core machine learning tasks, generally via the use of trees. The abstractions that he has developed to understand dual-tree algorithms has enabled the development of mlpack, a fast C++ machine learning library. He does not enjoy writing biographies, and he is also an amateur metallurgist
- Date:
- Speakers:
- Ryan Curtin
- Affiliation:
- Georgia Institute of Technology
-
-
Jeff Running
-