Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
b-Bit Minwise Hashing

Ping Li and Arnd Christian König


This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc.

By only storing b bits of each hashed value (e.g., b = 1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b = 1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b = 64 (or b = 32), if one is interested in resemblance > 0.5. Our theoretical results are validated using a proprietary collection of 106 news articles and a public dataset of 300.000 articles.


Publication typeInproceedings
Published inNineteenth International World Wide Web Conference (WWW 2010)
PublisherAssociation for Computing Machinery, Inc.

Newer versions

Ping Li, Arnd Christian König, and Wenhao Gui. b-Bit Minwise Hashing for Estimating Three-Way Similarities, 6 December 2010.

Ping Li and Arnd Christian König. Theory and Applications of b-Bit Minwise Hashing, Communications of the ACM, ACM, August 2011.

> Publications > b-Bit Minwise Hashing