Mira Dontcheva, Steven M. Drucker, David Salesin, and Michael F. Cohen
We present an analysis of the prevalence and nature of structural changes of websites. We study the evolution of some 12,000 webpages from 20 different websites over a period of five months. The websites cover a wide spectrum in both types of content and volume of traffic. We find that the structure of webpages from lower-volume sites changes very little, while webpages from high-volume sites change in mostly minor ways. Some of these sites go through drastic structural changes, but only on the order of once every couple of months. We discuss the implications of these observed changes for the design of structure-based extraction algorithms and how they can evolve over time. Our analysis leads us to the conclusion that structural extraction algorithms can play an important role in future applications for aggregating and summarizing Web content.
|Institution||University of Washington|