Crawling Deep-web Entity Pages

  • ,
  • Dong Xin ,
  • Venkatesh Ganti ,
  • Sriram Rajaraman

Proceedings of International Conference on Web Search and Data Mining (WSDM) |

Deep-web crawl is concerned with the problem of surfacing hidden
content behind search interfaces on the Web. While many
deep-web sites maintain document-oriented textual content (e.g.,
Wikipedia, PubMed, Twitter, etc.), which has traditionally been the
focus of the deep-web literature, we observe that a significant portion
of deep-web sites, including almost all online shopping sites,
curate structured entities as opposed to text documents. Although
crawling such entity-oriented content is clearly useful for a variety
of purposes, existing crawling techniques optimized for document
oriented content are not best suited for entity-oriented sites. In this
work, we describe a prototype system we have built that specializes
in crawling entity-oriented deep-web sites. We propose techniques
tailored to tackle important subproblems including query generation,
empty page filtering and URL deduplication in the specific
context of entity oriented deep-web sites. These techniques are experimentally
evaluated and shown to be effective.