A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

  • Shaozhi Ye ,
  • Ruihua Song ,
  • Ji-Rong Wen ,
  • Wei-Ying Ma

6th Asia-Pacific Web Conference, APWeb 2004, Hangzhou, China, April 14-17, 2004 |

Published by Springer Berlin Heidelberg

Publication

Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by todayś commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. This hybrid method provides not only an effective but also scalable solution for duplicate detection.