The aim of the PageTurner project is to measure how fast web pages turn over. We have crawled the same set of 330 million URL nine times over three months, saving a feature vector of every downloaded page. We then condensed the data to a change vector for each URL. We are currently evaluating the condensed data.
Search engines have a vested interest to understand how the web evolves over time. Understanding the rate and amount at which pages change will inform their crawl scheduling policy. Moreover, if past changes are predictive of future changes, change history can be used to prioritize page downloads.
Previous studies have shown that about a third of all pages on the web are near-duplicates of other pages, and that there is a fair amount of sites that are being mirrored in their entirety. Again, search engines have a vested interest in understanding whether pages that are duplicates of one another change in concert, or whether sites are copied at some point in time and then evolve independently. If the former is true, then search engines can invest fewer resources in monitoring replicated content.
In order to shed light on these questions, we performed a series of large-scale web crawls that tracked the evolution of a set of 150 million web pages over the span of eleven weeks. We found that fewer web pages change significantly than was previously believed; that past changes to a page are highly predictive of future changes; and that clusters of replicated pages (as well as mirrored sites) are extremely stable.
- Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. In Proceedings of the 12th International World Wide Web Conference, Budapest, Hungary, May 2003. [HTML, PS, PDF]
- Dennis Fetterly, Mark Manasse, and Marc Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In Proceedings of the 1st Latin American Web Congress, Santiago, Chile, November 2003. [PS,PDF] (Copyright IEEE 2003)
- Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. In Software: Practice & Experience, 34(2):213-237, February 2004. [Abstract, draft] (Copyright John Wiley & Sons 2004)
- Dennis Fetterly, Mark Manasse, and Marc Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. Journal of Web Engineering, 2(4):228-246, October 2004. [Abstract]
- A note on computing document similarity
Janet Wiener (HP Labs)