Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

  • Zhe Wang ,
  • Jim Gemmell

MSR-TR-2006-30 |

Publication

As lifetime personal storage is becoming a reality, we find that it is becoming increasingly difficult to search and navigate the contents one accumulates. One of the most striking issues is the duplicates and near duplicates that clutter search and navigation. We investigated different technique to eliminate the duplicates and near duplicates objects in the MyLifeBits personal storage system. Our results show the effectiveness of near-duplicate detection on personal contents like emails, documents and web pages visited. In one experiment, duplicate and near-duplicate detection reduced the number of documents a user must consider by 21% and the number of web pages by 43%.