The Learning-Curve Sampling Method Applied to Model-Based Clustering

Christopher Meek, Bo Thiesson, and David Heckerman

February 2001

We examine the learning-curve sampling method, an approach for applying machinelearning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.

Publication type | Article |

Published in | Journal of Machine Learning Research |

URL | http://www.ai.mit.edu/projects/jmlr/ |

Pages | 397-418 |

Volume | 2 |

Publisher | Journal of Machine Learning Research © Copyright 2001, JMLR |

- Founder Effects in the Assessment of HIV Polymorphisms and HLA Allele Associations
- Partitioned Logistic Regression for Spam Filtering
- Speeding up HMM algorithms for genetic linkage analysis

> Publications > The Learning-Curve Sampling Method Applied to Model-Based Clustering