Share this page
Share this page E-mail this page Print this page RSS feeds
Home > Publications > Ranking through Random Sampling
Ranking through Random Sampling

We consider ranking of items with respect to their frequencies or values based on random sampling of items. The ranking by frequency includes variants such as identifying a set of top-k most frequent items, identifying and ranking the top-k most frequent items, and identifying the top-k most frequent items and estimating their frequencies. For the ranking by value, we consider creation of a histogram based on the values associated to items. We show how much sampling is needed to guarantee given ranking objective with the probability of error bounded by given parameter. We also identify sequential sampling algorithms that enable on-line deciding about the needed sample size, without a prior information about the underlying distribution from which the sampling is done. To the best of our knowledge, our results for ranking by frequency are novel and those for ranking by value extend the previously-known result by Chaudhuri et al (1998) to histograms of arbitrarily given widths, a case of interest in practice. Our results are applicable in the context of querying large-scale data sets.

MSR-TR-2009-08.pdf
PDF file

Publisher: Microsoft
© 2008 Microsoft Corporation. All rights reserved.

Details

Type: TechReport
Number: MSR-TR-2009-2015