Duplicate News Story Detection Revisited

  • Omar Alonso ,
  • Dennis Fetterly ,
  • Mark Manasse

MSR-TR-2013-60 |

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work that mproves the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to evaluate similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

Publication Downloads

ClueWeb 09 Labeled Near-Duplicate News Articles

August 28, 2013

This data release is a companion to the paper Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse published at The Ninth Asia Information Retrieval Societies Conference (AIRS 2013) in December 2013. The package provides approximately 5.5 million document identifiers of a subset of the ClueWeb’09 “Category A English” documents that are likely to be from news sources. The package also contains two sets of human generated labels. The first set of labels is the assessment of 456 pairs of documents by the authors as near-duplicate, non-duplicate, containment, near-duplicate irrelevant, or non-duplicate irrelevant. The second set of labels is 710 labels obtained via a crowdsourcing system where the pairs of articles are labeled as near-duplicate or non-duplicate articles. Finally, the data release contains the experimental design templates used for the crowdsourced assessments.