Ping Li and Arnd Christian König
Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice.
We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.
In Communications of the ACM
Ping Li, Arnd Christian König, and Wenhao Gui. b-Bit Minwise Hashing for Estimating Three-Way Similarities, 6 December 2010.
Ping Li and Arnd Christian König. b-Bit Minwise Hashing, Association for Computing Machinery, Inc., 26 April 2010.