Utility-Directed Policies for Continual Caching and Crawling

We consider an environment where distributed data sources are updated autonomously and transparently and where no notifications of such updates are sent. Our goal is to optimize the processes by which we can maintain local copies or other representations of the data fresh with maximal expected value given usage of the data and the costs of accessing unrevised data. For example, modern search engines rely on incremental web crawlers. These crawlers maintain local copies of web pages. In order to keep the data fresh, the crawlers are responsible for periodically polling previously downloaded pages. Freshness is trivially guaranteed by revisiting all pages often. However, under limited resources, the number of pages we can visit every time step is limited, forcing a system to prioritize. We define freshness formally and seek revisitation policies that ensure good freshness for the pages visited by users over time, taking into account the different attributes of pages: some pages seldom change, and thus should not be visited frequently. Others may change a lot, but are not very popular amongst users. Our algorithm is both optimal and efficiently computed. We experiment with real data and showed improvement over existing policies. Beyond crawling and indexing, applications of the algorithm include maintaining a proxy cache, and formulating maintenance schedules for machines which are subject to breakdowns or quality degradation over time.

Details

TypeTechReport
NumberMSR-TR-2009-42
> Publications > Utility-Directed Policies for Continual Caching and Crawling