|
Studies of Web Evolution
Search engines have a vested interest to understand how the web evolves over time. Understanding the rate and amount at which pages change will inform their crawl scheduling policy. Moreover, if past changes are predictive of future changes, change history can be used to prioritize page downloads. Previous studies have shown that about a third of all pages on the web are near-duplicates of other pages, and that there is a fair amount of sites that are being mirrored in their entirety. Again, search engines have a vested interest in understanding whether pages that are duplicates of one another change in concert, or whether sites are copied at some point in time and then evolve independently. If the former is true, then search engines can invest fewer resources in monitoring replicated content. In order to shed light on these questions, we performed a series of large-scale web crawls that tracked the evolution of a set of 150 million web pages over the span of eleven weeks. We found that fewer web pages change significantly than was previously believed; that past changes to a page are highly predictive of future changes; and that clusters of replicated pages (as well as mirrored sites) are extremely stable. Further Reading
Project Members
Dennis Fetterly Collaborators
Janet Wiener (HP Labs) |