Fast Dual-Tree k-means with Bounded Single-Iteration Runtime

k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N, each iteration of Lloyd’s algorithm costs O(kN) time. Although there are a handful of techniques to accelerate single Lloyd iterations, none of these techniques are tailored to the case of large k, which is increasingly common as dataset sizes grow (one example: vector quantization), and none of these techniques have a worst-case runtime bound of less than O(kN) per iteration. I present a dual-tree algorithm with a worst-case runtime bound of O(N + k log k) under certain assumptions on the dataset, and show that this algorithm outperforms all other alternatives as k and N grow. An implementation is readily available in mlpack.

Speaker Details

Ryan Curtin is a final-year Ph.D. candidate at the Georgia Institute of Technology. His primary research interests are the acceleration of core machine learning tasks, generally via the use of trees. The abstractions that he has developed to understand dual-tree algorithms has enabled the development of mlpack, a fast C++ machine learning library. He does not enjoy writing biographies, and he is also an amateur metallurgist

Date:
Speakers:
Ryan Curtin
Affiliation:
Georgia Institute of Technology
    • Portrait of Jeff Running

      Jeff Running