Mark Manasse, Frank McSherry, and Kunal Talwar
We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T, the probability that the two choose the same sample is equal to the Jaccard similarity between them:
Pr[sample(S) = sample(T)] = sumx min(S(x), T(x)) / sumx max(S(x), T(x))
where sample(S) = (x, y) with 0 < y<= S(x). The sampling process takes expected time linear in the number of non-zero weights in S, independent of the weights themselves.
We discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially. We additionally discuss a variety of settings in which weighted sampling with highly divergent weights is useful, for example TF-IDF style weighting for determining similarity of web pages, and privacy-preserving auctions.
© 2008 Microsoft Corporation. All rights reserved.