Efficient Exact Set-Similarity Joins

Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB 2006 |

Published by Very Large Data Bases Endowment Inc.

Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.