# Finding similar things quickly in large collections

## Mark Manasse

MSR Silicon Valley

### Motivation

It is often of interest, given a large collection of *things, *to quickly determine if many of the things are effectively the same. Consider, for example, the problem of identifying music from a Napster-like service: it may be of interest to RIAA to ascertain which of the songs are copies of songs in their catalog. It might be of use to a search engine for the web to identify near-duplicate pages, to reduce the clutter on a results page, or to reduce the number of pages in the full-text index. It might be of interest to an intelligence agency when scanning a crowd at a public event, when trying to identify which members of the crowd are likely to be suspected malefactors.

Note that in all of these, we are interested not in bit-for-bit identity (which is easy to determine using hashing), but in effective identity, despite variations in the source due to alternative ripping or compression, minor editing, or differences due to aging or imprecise measurements. This leaves us with the problem of finding collections of highly-similar things in a large collection, or of identifying the most similar things in a large collection to a given thing.

In general, we assume that there is a good algorithm for determining whether two things are effectively the same. We will assume that this algorithm works by *feature-extraction* on each of the two things, and then by comparing features.

### Precise features

Let us first consider the case where features are precise: each feature can be represented as a string, and two features match if and only if the strings are identical. Moreover, we assume that concordance of features tells us something interesting and useful. More precisely, if *D*_{1} and *D*_{2} are documents, let *F*_{1} and *F*_{2} be their sets of features. Define Sim(*D*_{1}, *D*_{2}) = ||*F*_{1} \union *F*_{2}|| ÷ ||*F*_{1} \intersection *F*_{2}|| to be the similarity of *D*_{1} and *D*_{2}; that is, the number of common features divided by the total number of features in the two things. We assume that feature-extraction has done something useful, so that the similarity of two essentially-equivalent things is a number close to one, but that for most pairs of dissimilar things, the similarity is a number close to zero. We can then rephrase our problems as equivalent to finding the pairs for which Sim(*D*, *E*) is large, or, fixing *D*, finding the set of *E* such that Sim(*D*, *E*) is large.

Computing these exactly is fairly expensive: we assume that the number of features extracted from each thing is potentially quite large (tens or hundreds of thousands, say), and that the number of things is very large (dozens of billions, probably). Pairwise comparison won't be efficient enough, no matter how fast the individual comparisons are; even enumerating all pairs of things sharing one feature may well be, we assume, prohibitively expensive (some features might be commonplace).

So, let's cheat: we'll use randomness to help us sample the feature sets, and reduce everything to a more-nearly manageable size. Once we've found strong candidates, we can examine those few surviving pairs more exactly, if we need to.

A supposition

Suppose we had a mechanism for selecting one feature *f** _{i}* from each

*D*

*, such that Prob(*

_{i}*f*

*=*

_{i}*f*

*) = Sim(*

_{j}*D*

*,*

_{i}*D*

*). If we ran this selection mechanism over all of our things, we would have point estimators for the similarity of things. If we ran independent selection mechanisms 100 times, we could get an estimate of the percentage of similarity of two things by counting matches in the vectors of selections.*

_{j}So let's do that; pick some number r of selectors, random but fixed for all time, and, for each thing *D** _{i}*, compute

*f*

_{i}^{1}, ...,

*f*

_{i}*, using each selector once on*

^{r}*D*

*. At the cost of a little additional preprocessing, we've reduced the data storage for each thing to a constant, and reduced comparison of sets to matching terms in vectors.*

_{i}But we can do better: we're only interested in high-degree similarity, and we can take advantage of simple combinatorics to make things even easier. If *p*=Sim(*D*, *E*), then each term in the vectors for *D* and *E* match with probability *p*. But then the probability of matching *k* terms in a row is *p** ^{k}*. If

*p*is on the order of 0.95, then

*p*

^{14}is around 0.5; if

*p*is closer to 0.9,

*p*

^{14}is around 0.2. We can then compress our vectors by hashing non-overlapping runs of

*k*items to single integers chosen from a large enough space that the probability of collisions in the hash values is negligible, while reducing our storage needs by a factor of

*k*. If we have

*s*groups of length

*k*, the probability of one or more groups matching is 1-(1-

*p*

*)*

^{k}*; the probability of two or more groups is 1-(1-*

^{s}*p*

*)*

^{k}*-*

^{s}*s*(1-

*p*

*)*

^{k}^{s-1}

*p*

*k*. In practice (see the graphs in our TechFest slides), this can be quite effective at identifying only things which are at least 0.75 similar, and not missing much which is 0.95 or more similar. In so doing, the vector length per thing is reduced to six elements, of which we hope to find two or more matches.

Applying a little more trivial combinatorics, we can simplify the matching process to a handful of hash table lookups per thing, by hashing all the pairs of the *s* reduced values above; if *s* is 6, *s* Choose 2 is only 15, so we can find items which match at least 2 runs with just 15 lookups in hash tables.

For Bing, we shortened the 14 down to about 5, and lengthened the 2 to 4, yielding roughly the same quality, but allowing for a further space reduction.

Discharging the supposition

All this said, we still haven't explained how to discharge the supposition about the selectors. Let's first consider a scheme which works, but which isn't implementable. Pick a random one-to-one function from the space of features to some well-ordered set; for example, pick a totally ordered set which is larger than the (finite) set of all features. Consider one-to-one functions from the space of features to the well-ordered set. From this function space, pick an element uniformly at random. Look at the image of the feature set under this function. By well-orderedness (or finiteness, if that's easier to think about), there is a unique smallest element of the image. Select the chosen feature to be the pre-image of that smallest element.

This works because all functions are equally probable. Any element of a set is equally likely to be mapped to the smallest element, and, when choosing from two sets, the smallest element is uniformly chosen from the union.

Making this practical requires a few trade-offs: to pick uniformly, it's convenient to make the image set a finite set of integers. If the feature set is unbounded, it's hard to get a one-to-one function to a finite set. Using a well-selected hash function (Rabin fingerprints, for instance), we can choose a set which is large enough that the probability of collisions across our set of things is vanishingly small.

Next, picking a completely random function from the set is unimplementable; instead, we choose a smaller, easily parameterized set of functions to choose from, and prove that that's good enough to get arbitrarily close to the right probability. In practice, we use a combination of linear congruential permutations with Rabin fingerprints.

### Imprecise features

The foregoing was enough to deal with precise features; we next briefly address working with imprecise features.

We assume that features can be placed into a metric space. For each type of feature we generate, we consider quantized neighborhoods of the feature such that close features give rise to neighborhoods with large overlap, while dissimilar features result in neighborhoods with small overlap. This requires choosing both the radius of the neighborhood in the metric space, as well as the density of the quantized set of features.

### Important features

We can make some features more important than others by repeating features enough times to bias the definition of similarity, by replacing feature *f* with features (*f*,1), (*f*,2), ..., (*f*, *n*). As long as a given feature is replicated the same number of times in each thing in which it occurs, we can tailor our definition of similarity to better match the semantic ideas as to what aspects of a thing are important.

Post 2007, we now know ways to efficiently allow the weights to be arbitrary non-negative reals and still make random selections preserving uniformity.

### An application

We worked with Alta Vista to reduce the size of the index by discarding near-duplicate pages, and to reduce the number of spam pages submitted to the index. The technique was to define feature-extraction for HTML documents (ignore case, discard most markup other than the URLs in HREFs and in IMGs, treat strings of six consecutive words after the foregoing normalization as a feature), and then apply the techniques described above for precise features, with *k*=14 and *s*=6. Using these techniques, we were able to efficiently reduce the number of fully-indexed pages by a third, with no significant reduction in the useful pages.

We called our feature-extraction mechanism *shingling*, because the overlapping strings of words look (to us) like shingles on a roof. Correspondingly, we named the technique related to the choice of *k **supershingling*, and that of *s **megashingling*.

### Improving sampling

A key step in the process is to select samples uniformly. As described above, this requires assigning a random number to each input feature for each sample which is to be chosen. That's a lot of randomness.

In 1997, we improved on this by instead choosing a short random string, and then only producing a full length string of randomness for those features which were contenders to have the smallest random number. Choosing the prefix to be an eight-bit slice of a 64-bit random number, we can improve the efficiency almost eightfold: almost, because one 256th of the input features are likely to receive the same smallest lead 8 bits, requring 64 more bits to be chosen; this adds an expected 1/4 bit per input feature per sample. Choosing only 4 bits instead of 8 balances the effort at an initial 4 bits plus an expected 4 continuation bits at the cost of creating really long candidate lists.

More recently, Kunal Talwar, Frank McSherry and I have been working on reducing the requisite randomness. Suppose, for example, that we choose an expected 2 bits, where the value taken as many zeroes as are produced, followed by the first one bit. In expectation, this consumes two bits to produce a sample. Nonetheless, there will be few collisions, because with 2^{k} input features, only a few will be fortunate enough to start with roughly k zeroes; three is an upper bound on the expected number of maximal length initial strings of zeroes. By producing a long string of bits, we can segment it at the ones, producing the samples for many selection functions in parallel at an expected cost of two bits, with an expected bounded number of ties independent of input length.

But we can do still better: suppose that we had a way to do what we just said, but the initial bit in each sequence was far more likely to be a one than a zero. In that case, we could reduce the expected length from 2 towards 1. But biasing the bits in this way is hard; fortunately, we can accomplish this using unbiased bits by combining probabilities. Suppose we had such a biased bit with bias one in 100. In that case, if we were selecting 100 samples. roughly one time in e (the base of the natural logarithm) all 100 samples would start with a 1. When this doesn't happen, we need to figure out which sample starts with a zero and produce more unbiased bits until the first one to break ties and figure out if any other samples start with a zero, but this can all be done relatively inexpensively if the number of features is large compared to the rejection probability. When that isn't true, we might end up in a situation where no feature was selected for a given sample, and then we need to reprocess our input the 2-bit way. But by carefully choosing the bias to work well with the expected size of our inputs, we can typically drive the selection cost down to about a third of a bit of randomness per input feature per sample.

### Ping Li's fast uniform sampling

In 2012, Ping Li, at the time a Cornell professor, came up with a huge performance improvement in uniform sampling. In 1997, we had already noted that rather than compute roughly 100 independent samples, using different random functions, we could take the 100 smallest images under a single hash function. This works, but makes the arithmetic more complicated, because to compare two sets we need to redo the 100 smallest from the union. Fortunately, the 100 smallest from the union are contained in the union of the 100 smallest from each, so we don't need to go back to the original documents to compute this. Unfortunately, we can no longer work with a vector of samples and compare only things in the same position; we have to work a bit harder.

Ping realized that if, instead, we create 100 bins, and pick a random function mapping a feature to a random bin and a random value, then we can make do with a single random function evaluation. Some of the bins may end up empty with short (<2000 feature) documents, but empty != non-empty for comparing vectors, but empty vs. empty doesn't count at all. The proof of approximation still works. Ping and Christian Koenig also began the consideration of reduced storage for selected features: pick a winner per bin as above, but now, for each winner, pick a pseudo random 2 bits keyed off of the winning feature. With 128 slots, say, 32 (1/4) will match due to luck even for completely divergent inputs, but (m-32)/96 is still and unbiased estimate of the similarity when the 2-bit vectors match in m positions, when this is non-negative, and our storage drops from 100 x 64 or 96 bits down to 128 x 2 bits, saving a factor of 30 or so in space for roughly equivalent predictive power.

### Notes, references, and prior work

Note: this work was done back while I was at Compaq's Systems Research Center, by myself and others. You can see an old page here, or go look at some papers:

Andrei Broder. **On the resemblance and containment of documents**, In *Compression and Complexity of Sequences (SEQUENCES'97)*, pages 21-29. IEEE Computer Society, 1998. (PDF), (Postscript), (Copyright 1997 IEEE).

Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. **Syntactic clustering of the Web**. In *Proceedings of the 6th International World Wide Web Conference*, pages 391-404, April 1997, also appeared as SRC Technical Note 1997-015. (HTML).

### Pointers

Code implementing shingling is available from Shingleprinting code for estimating document similarity on the Microsoft Research code-posting site.