Web Data Extraction and Search

Established: February 9, 2013

The goal of this project is to extract structured data on the web (like html tables, lists, spreadsheets etc.) and make it accessible/searchable on Bing and Office 365.

Some of the technical challenges:

  • Table classification and understanding: The vast majority of html tables are used for formatting/layout purposes; they do not any contain useful content . How do we automatically filter out such tables? Furthermore, there are various types of tables like relational tables (each row corresponds to a different entity and each column corresponds to a different attribute) and attribute-value tables (each row corresponds to a different attribute, e.g., tables on dpreview.com). How do we automatically distinguish these tables from each other? For relational tables, there is typically a column (or a set of columns) that contain the subject entities. How do we identify this column(s)? For attribute-value tables, how do we identify the subject entity?
  • Query classification: For the Bing table answer feature, we want to show a table only if the intent of the query is a table (or part of a table), not simply because a table with great match is available. How to we identify such queries?
  • Table matching and ranking: For Bing table answer, how do we identify the best table or part of table (if one exists) for a query with table intent? In Excel table search, how do we rank the tables in response to a keyword search query?
  • New modes of search: Keyword search may not be the only way to search for structured information. In a spreadsheet setting, other modes of search are possible like entity augmentation and attribute discovery proposed in the InfoGather/InfoGather+ papers.

Impact

Our web data research had tremendous impact of several Microsoft products and services over the years:

Past interns: Mohamed Yakout, Chi Wang, Meihui Zhang, Mohan Yang

People

Portrait of Chi Wang

Chi Wang

Principal Researcher

Portrait of Kaushik Chakrabarti

Kaushik Chakrabarti

Senior Researcher

Portrait of Surajit Chaudhuri

Surajit Chaudhuri

Technical Fellow, Data Platforms and Analytics