Finding similar things quickly in large collections
It is often of interest, given a large collection of things, to quickly determine if many of the things are effectively the same. Consider, for example, the problem of identifying music from a Napster-like service: it may be of interest to RIAA to ascertain which of the songs are copies of songs in their catalog. It might be of use to a search engine for the web to identify near-duplicate pages, to reduce the clutter on a results page, or to reduce the number of pages in the full-text index. It might be of interest to an intelligence agency when scanning a crowd at a public event, when trying to identify which members of the crowd are likely to be suspected malefactors.
Note that in all of these, we are interested not in bit-for-bit identity (which is easy to determine using hashing), but in effective identity, despite variations in the source due to alternative ripping or compression, minor editing, or differences due to aging or imprecise measurements. This leaves us with the problem of finding collections of highly-similar things in a large collection, or of identifying the most similar things in a large collection to a given thing.
In general, we assume that there is a good algorithm for determining whether two things are effectively the same. We will assume that this algorithm works by feature-extraction on each of the two things, and then by comparing features.
Let us first consider the case where features are precise: each feature can be represented as a string, and two features match if and only if the strings are identical. Moreover, we assume that concordance of features tells us something interesting and useful. More precisely, if D1 and D2 are documents, let F1 and F2 be their sets of features. Define Sim(D1, D2) = ||F1 \union F2|| ÷ ||F1 \intersection F2|| to be the similarity of D1 and D2; that is, the number of common features divided by the total number of features in the two things. We assume that feature-extraction has done something useful, so that the similarity of two essentially-equivalent things is a number close to one, but that for most pairs of dissimilar things, the similarity is a number close to zero. We can then rephrase our problems as equivalent to finding the pairs for which Sim(D, E) is large, or, fixing D, finding the set of E such that Sim(D, E) is large.
Computing these exactly is expensive: we assume that the number of features extracted from each thing is potentially quite large (tens or hundreds of thousands, say), and that the number of things is very large (billions, probably). Pairwise comparison won't be efficient enough. Enumerating all pairs of things sharing one feature is, we assume, prohibitively expensive (some features might be commonplace).
So, let's cheat: we'll use randomness to help us sample the feature sets, and reduce everything to a more-nearly manageable size.
Suppose we had a mechanism for selecting one feature fi from each Fi, such that Prob(fi = fj) = Sim(Di, Dj). If we ran this selection mechanism over all of our things, we would have point estimators for the similarity of things. If we ran independent selection mechanisms 100 times, we could get an estimate of the percentage of similarity of two things by counting matches in the vectors of selections.
So let's do that; pick some number r of selectors, and, for each thing Di, compute fi1, ..., fir, using each selector once on Di. At the cost of a little additional preprocessing, we've reduced the data storage for each thing to a constant, and reduced comparison of sets to matching terms in vectors.
But we can do better: we're only interested in high-degree similarity, and we can take advantage of simple combinatorics to make things even easier. If p=Sim(D, E), then each term in the vectors for D and E match with probability p. But then the probability of matching k terms in a row is pk. If p is on the order of 0.95, then p14 is around 0.5; if p is closer to 0.9, p14 is around 0.2. We can then compress our vectors by hashing non-overlapping runs of k items to single integers chosen from a large enough space that the probability of collisions in the hash values is negligible, while reducing our storage needs by a factor of k. If we have s groups of length k, the probability of one or more groups matching is 1-(1-pk)s; the probability of two or more groups is 1-(1-pk)s-s(1-pk)s-1pk. In practice (see the graphs in our TechFest slides), this can be quite effective at identifying only things which are at least 0.75 similar, and not missing much which is 0.95 or more similar. In so doing, the vector length per thing is reduced to six elements, of which we hope to find two or more matches.
Applying a little more trivial combinatorics, we can simplify the matching process to a handful of hash table lookups per thing, by hashing all the pairs of the s reduced values above; if s is 6, s Choose 2 is only 15, so we can find items which match at least 2 runs with just 15 lookups in hash tables.
Discharging the supposition
All this said, we still haven't explained how to discharge the supposition about the selectors. Let's first consider a scheme which works, but which isn't implementable. Pick a random one-to-one function from the space of features to some well-ordered set; that is, pick a totally ordered set which is larger than the set of all features. Consider one-to-one functions from the space of features to the well-ordered set. From this function space, pick an element uniformly at random. Look at the image of the feature set under this function. By well-orderedness, there is a smallest element of the image. Let the chosen feature by the pre-image of that smallest element.
This works because all functions are equally probable. Any element of a set is as likely to be mapped to the smallest element, and, when choosing from two sets, the smallest element is uniformly chosen from the union.
Making this practical requires a few trade-offs: to pick uniformly, it's convenient to make the image set a finite set of integers. If the feature set is unbounded, it's hard to get a one-to-one function to a finite set. Using a well-selected hash function (Rabin fingerprints, for instance), we can choose a set which is large enough that the probability of collisions across our set of things is vanishingly small.
Next, picking a completely random function from the set is unimplementable; instead, we choose a smaller, easily parameterized set of functions to choose from, and prove that that's good enough to get arbitrarily close to the right probability. In practice, we use a combination of linear congruential permutations with Rabin fingerprints.
The foregoing was enough to deal with precise features; we next briefly address working with imprecise features.
We assume that features can be placed into a metric space. For each type of feature we generate, we consider quantized neighborhoods of the feature such that close features give rise to neighborhoods with large overlap, while dissimilar features result in neighborhoods with small overlap. This requires choosing both the radius of the neighborhood in the metric space, as well as the density of the quantized set of features.
We can make some features more important than others by repeating features enough times to bias the definition of similarity, by replacing feature f with features (f,1), (f,2), ..., (f, n). As long as a given feature is replicated the same number of times in each thing in which it occurs, we can tailor our definition of similarity to better match the semantic ideas as to what aspects of a thing are important.
We worked with Alta Vista to reduce the size of the index by discarding near-duplicate pages, and to reduce the number of spam pages submitted to the index. The technique was to define feature-extraction for HTML documents (ignore case, discard most markup other than the URLs in HREFs and in IMGs, treat strings of six consecutive words after the foregoing normalization as a feature), and then apply the techniques described above for precise features, with k=14 and s=6. Using these techniques, we were able to efficiently reduce the number of fully-indexed pages by a third, with no significant reduction in the useful pages.
We called our feature-extraction mechanism shingling, because the overlapping strings of words look (to us) like shingles on a roof. Correspondingly, we named the technique related to the choice of k supershingling, and that of s megashingling.
A key step in the process is to select samples uniformly. As described above, this requires assigning a random number to each input feature for each sample which is to be chosen. That's a lot of randomness.
In 1997, we improved on this by instead choosing a short random string, and then only producing a full length string of randomness for those features which were contenders to have the smallest random number. Choosing the prefix to be an eight-bit slice of a 64-bit random number, we can improve the efficiency almost eightfold: almost, because one 256th of the input features are likely to receive the same smallest lead 8 bits, requring 64 more bits to be chosen; this adds an expected 1/4 bit per input feature per sample. Choosing only 4 bits instead of 8 balances the effort at an initial 4 bits plus an expected 4 continuation bits at the cost of creating really long candidate lists.
More recently, Kunal Talwar, Frank McSherry and I have been working on reducing the requisite randomness. Suppose, for example, that we choose an expected 2 bits, where the value taken as many zeroes as are produced, followed by the first one bit. In expectation, this consumes two bits to produce a sample. Nonetheless, there will be few collisions, because with 2k input features, only a few will be fortunate enough to start with roughly k zeroes; three is an upper bound on the expected number of maximal length initial strings of zeroes. By producing a long string of bits, we can segment it at the ones, producing the samples for many selection functions in parallel at an expected cost of two bits, with an expected bounded number of ties independent of input length.
But we can do still better: suppose that we had a way to do what we just said, but the initial bit in each sequence was far more likely to be a one than a zero. In that case, we could reduce the expected length from 2 towards 1. But biasing the bits in this way is hard; fortunately, we can accomplish this using unbiased bits by combining probabilities. Suppose we had such a biased bit with bias one in 100. In that case, if we were selecting 100 samples. roughly one time in e (the base of the natural logarithm) all 100 samples would start with a 1. When this doesn't happen, we need to figure out which sample starts with a zero and produce more unbiased bits until the first one to break ties and figure out if any other samples start with a zero, but this can all be done relatively inexpensively if the number of features is large compared to the rejection probability. When that isn't true, we might end up in a situation where no feature was selected for a given sample, and then we need to reprocess our input the 2-bit way. But by carefully choosing the bias to work well with the expected size of our inputs, we can typically drive the selection cost down to about a third of a bit of randomness per input feature per sample.
Notes, references, and prior work
Note: this work was done back while I was at Compaq's Systems Research Center, by myself and others. You can see an old page here, or go look at some papers:
Andrei Broder. On the resemblance and containment of documents, In Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE Computer Society, 1998. (PDF), (Postscript), (Copyright 1997 IEEE).
Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference, pages 391-404, April 1997, also appeared as SRC Technical Note 1997-015. (HTML).
Code implementing shingling is available from Shingleprinting code for estimating document similarity on the Microsoft Research code-posting site.