
Speaker Yuval Rabani Host Yuval Rabani Affiliation The Hebrew University of Jerusalem Duration 01:08:37 Date recorded 27 June 2013 We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n]=1,2,…,n} and the mixture weights, using O(n polylog n) samples. (In the topicmodel learning setting, the mixture constituents correspond to the topic distributions.) This task is informationtheoretically impossible for k1 under the usual sampling process from a mixture distribution. However, there are situations (such as the abovementioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the “sampling aperture”, is a crucial parameter of the problem. We obtain the first bounds for this mixturelearning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the informationtheoretically leastpossible aperture of 2k1. Thus, we achieve nearoptimal dependence on n and optimal aperture. While the samplesize required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis. Joint work with Leonard Schulman and Chaitanya Swamy.
©2013 Microsoft Corporation. All rights reserved.
