Detecting Duplicate Web Documents using Clickthrough Data

The web contains many duplicate and near-duplicate documents. Given that user satisfaction is negatively affected by redundant information in search results, a significant amount of research has been devoted to developing duplicate detection algorithms. However, most such algorithms rely solely on document content to detect duplication, ignoring the fact that a primary goal of duplicate detection is to identify documents that contain redundant information with respect to a particular user query. Similarly, although query-dependent result diversification algorithms compute a query-dependent ranking, they tend to do so on the basis of a query-independent content similarity score.

In this paper, we bridge the gap between query-dependent redundancy and query-independent duplication by showing how user click behavior following a query provides evidence about the relative novelty of web documents. While most previous work on interpreting user clicks on search results has assumed that they reflect just result relevance, we show that clicks also provide information about duplication between web documents since users consider search results in the context of previously seen documents. Moreover, we find that duplication explains a substantial amount of presentation bias observed in clicking behavior. We identify three distinct types of redundancy that commonly occur on the web and show how click data can be used to detect these different types.

PDF file

In  ACM International Conference on Web Search and Data Mining (WSDM)

Publisher  Association for Computing Machinery, Inc.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2011, ACM


> Publications > Detecting Duplicate Web Documents using Clickthrough Data