Fingerprints for Highly Similar Streams
- Yoram Bachrach ,
- Ely Porat
Information and Computation |
We propose an approach for approximating the Jaccard similarity of two streams, for domains where this similarity is known to be high. Our method is based on a reduction from Jaccard similarity to F_2 norm estimation, for which there exists a sketch that is efficient in terms of both size and compute time, which we augment by a sampling technique. Our approach offers an improvement in the fingerprint size that is quadratic in the degree of similarity between the streams. Further, computing our fingerprint can be done in time O(1) per element in the stream.
NOTICE: this is the author's version of a work that was accepted for publication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published at http://www.elsevier.com/.