The goal of WebQ is to develop systems and tools to associate semantics to data on the web, so that the data can be used by machines. To do this, we use external knowledgebases to facilitate understanding of web pages (e.g., tables on the web), or we derive the intrinsic semantics based on how data is organized and presented on the web.
- The Web contains a wealth of information, and a key challenge is to make this information machine processable. Because natural language understanding at web scale remains difficult and costly at present, we focus our attention on understanding
well-structured html tables on the Web. Our work focuses on detecting these tables, understanding their content, and using the obtained information and knowledge to support important applications such as search. Our starting point is a rich, general purpose taxonomy called Probase whose content is harvested automatically from the Web and search log data. We use the taxonomy to help us interpret and understand tables. We then use the content we understand to enrich the taxonomy, which, in turn, enables us to understand more tables. We report large scale experimental results that demonstrate the feasibility of this approach, and we build a semantic search engine over tables to demonstrate how structured data can empower information retrieval on the Web.
- Information extraction from the Web is of growing importance. Objects on the Web are often associated with many attributes that describe the objects. It is essential to extract these attributes and map them to their corresponding objects. However, much attribute information about an object is hidden in the dynamic user interaction and is not on the Web page that describes the object. Existing information extraction approaches focus on getting information from the object Web page only, which means a lot of attribute information is lost. In this work, we study the dynamic user interaction on exploratory search Websites and propose a novel link-based approach to discover attributes and map them to objects. We build an exploratory search model for exploratory Web sites, and we propose algorithms for identifying, clustering, and relationship mining of related Web pages based on the model. Using the unsupervised method in our approach, we are able to discover hidden attributes not explicitly shown on object Web pages.
- Jingjing Wang, Bin Shao, Haixun Wang, and Kenny Zhu, Understanding Tables on the Web, no. MSR-TR-2011-29, March 2011.
- Jiuming Huang, Haixun Wang, Ariel Fuxman, and Yan Jia, Toward Query-centric Web Modeling and Crawling, no. MSR-TR-2011-27, March 2011.
- Jiuming Huang, Haixun Wang, Yan Jia, and Ariel Fuxman, Link-based Hidden Attribute Discovery for Objects on Web, in International Conference on Extending Database Technology (EDBT), March 2011.