Understanding Tables on the Web
The World Wide Web contains a wealth of information. Unfortunately, most of this information is understood only by humans but not by machines. A key challenge is thus to make such information machine accessible and processable.
In this paper, we focus on mining information from structured data that reside within Web documents. The particular structured data we are concerned with are tables. The reason we choose tables is two-fold. First, there are billions of tables on the Web, and many of them contain valuable information. Second, tables are already well structured and relatively easier to understand, whereas converting free text into structured data using natural language processing techniques is a very costly and time consuming process for large web corpus. Our work focuses on detecting these tables, understanding their content, and using the obtained information and knowledge to support important applications such as search.
Our starting point is a rich, general purpose taxonomy Probase, whose content is harvested automatically from the Web and search log data. We use the taxonomy to help us interpret and understand tables. We then use the content we understand to enrich the taxonomy, which, in turn, enables us to understand more tables. Besides, we build a semantic search engine over tables to demonstrate how much valuable information can be found in Web tables, and how structured data can empower information retrieval on the Web.