A System for Extracting Top-K Lists from the Web (demo)

Zhixian Zhang, Kenny Zhu, and Haixun Wang


List data is an important source of structured data on the web. This paper is concerned with “top-k” pages, which are web pages that describe a list of k instances of a particular topic or concept. Examples include “the 10 tallest persons in the world” and “the 50 hits of 2010 you don’t want to miss”. Compared to normal web list data, “top-k” lists contain richer information and are easier to understand. Therefore the extraction of such lists can help enrich existing knowledge bases about general concepts, or act as a preprocessing step to produce facts for a fact answering engine. We present an efficient system that extracts the target lists from web pages with high accuracy. We have used the system to process up to 160 million, or 1/10 of a high-frequency web snapshot from Bing, and obtained over 140,000 lists with 90.4% precision.


Publication typeInproceedings
Published inSIGKDD
