Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Automatic Extraction of Top-k Lists from the Web

Zhixian Zhang, Kenny Zhu, Haixun Wang, and Hongsong Li

Abstract

This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don’t want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall

Details

Publication typeInproceedings
Published inICDE
PublisherInternational Conference on Data Engineering
> Publications > Automatic Extraction of Top-k Lists from the Web