Consistent Weighted Sampling

ABSTRACT

We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T, the probability that the two choose the same sample is equal to the Jaccard similarity between them:

Pr[sample(S) = sample(T)] = \sum_x min(S(x), T(x)) / \sum_x max(S(x), T(x))

where sample(S) = (x, y) with 0 < y<= S(x). The sampling process takes expected time linear in the number of non-zero weights in S, independent of the weights themselves.

We discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially. We additionally discuss a variety

of settings in which weighted sampling with highly divergent weights is useful, for example TF-IDF style weighting for determining similarity of web pages, and privacy-preserving auctions.

ConsistentWeightedSampling2.pdf
PDF file

Publisher  Microsoft Research
© 2008 Microsoft Corporation. All rights reserved.

Details

TypeTechReport
NumberMSR-TR-2010-73
> Publications > Consistent Weighted Sampling