Accelerating EM for large databases

Bo Thiesson, Christopher Meek, and David Heckerman

January 2001

The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (1998) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide a method for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.

Publication type | Article |

Published in | Machine Learning |

URL | http://www.wkap.nl/kaphtml.htm/ |

Pages | 279-299 |

Volume | 45 |

Publisher | Kluwer Academic All copyrights reserved by Kluwer Academic 2001. |

- Heuristic greedy search algorithms for latent variable models
- Transmission of HIV-1 Gag immune escape mutations is associated with reduced viral load in linked recipients
- Using path diagrams as a structural equation modeling tool

> Publications > Accelerating EM for large databases