Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Consistent Weighted Sampling

Mark Manasse, Frank McSherry, and Kunal Talwar

Abstract

ABSTRACT
We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T, the probability that the two choose the same sample is equal to the Jaccard similarity between them:

Pr[sample(S) = sample(T)] = sumx min(S(x), T(x)) / sumx max(S(x), T(x))

where sample(S) = (x, y) with 0 < y<= S(x). The sampling process takes expected time linear in the number of non-zero weights in S, independent of the weights themselves.

We discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially. We additionally discuss a variety of settings in which weighted sampling with highly divergent weights is useful, for example TF-IDF style weighting for determining similarity of web pages, and privacy-preserving auctions.

Details

Publication typeTechReport
NumberMSR-TR-2010-73
PublisherMicrosoft Research
> Publications > Consistent Weighted Sampling