Jiuming Huang, Haixun Wang, Ariel Fuxman, and Yan Jia
Crawling and information extraction is perhaps one of the most extensively studied topics in the web age. Albeit much progress has been made in this area, it is still a page-centric, computation intensive process, and more often than not, it relies on manually crafted templates. But how a web site organizes its data has gone through significant changes since the early days of web, and many web sites, especially e-commerce sites, now support exploratory search, which presents data to the user through an implicit query interface. In this paper, we conduct a comprehensive survey of exploratory web sites, and we show that traditional page-centric crawling is extremely wasteful and the crawled data is seriously incomplete. We propose a query-centric view of web data, and an automatic crawling framework that crawls web sites by queries instead of by pages. In essence, we make queries first class citizens in modeling web data. This allows us to target the right content in crawling, which not only makes crawling more efficient, but also enables us to collect data that is hidden from page views and hence unreachable to traditional crawling methods. We conduct extensive experiments to demonstrate the advantage of the new model, and the performance of our new crawling method.
|Publisher||Microsoft Technical Report|