Both enterprise information workers (IWs) and consumers rely on structured data to make business or personal decisions. Often, they do not have knowledge of the sources of the relevant data. They need to search for relevant structured data; this is quite difficult today. The goal is to make it easy to search, consume and combine datasets, both within the enterprise and on the web.
Web Table Search
There exists a lot of structured data on the surface web (as html tables, html lists, spreadsheets and csv files). This data can be of immense benefit to both IWs and consumers. However, it is hard to search these datasets, hard to consume them in Excel (which is the playground of data for IWs) and hard to combine them with other datasets. One can use Bing or Google to search for them but those engines are not built for data search. Furthermore, consuming them in Excel or combining them with other datasets still remains a challenge.
In this project, we extract more than 200 million tables from the web, index them and allow users to search them and consume them from Excel. Some of the technical challenges we addressed are:
- Table classification: A lot of html tables are used for navigational or layout purposes; they do not any contain useful content for IWs. How do we automatically filter out such tables? Furthermore, there are various types of tables like relational tables (where each row corresponds to a different entity and each column corresponds to a different attribute) and attribute-value tables (the ones about a single entity where each row corresponds to a different attribute, e.g., Infobox tables). How do we automatically distinguish these tables from each other?
- Table understanding: For relational tables, there is typically a column (or a set of columns) that contain the subject entities. How do we identify this column(s)? For attribute-value tables, how do we identify the single subject entity? How do we identify the names of the attributes and the corresponding values for both types of tables?
- Table ranking: How do we rank the tables in response to a keyword search query?
- Scalability: How do we scalably extract hundreds of millions of tables from billions of web pages? How do we serve these hundreds of millions of tables?
- We worked closely with SQL Server and Excel groups and integrated this technology into Excel (as the "Online Search" feature in Power Query). It has been available as an Excel Add-In since February 2013 (you can easily install it and play with it). In July 2013, Microsoft announced that it will be part of Office 365. We continue to work to improve the quality of table search, expand the content sources and invent new ways to search and explore structured data.
Keyword search on databases
- Keyword search is an easy way for business users to search in enteprise databases without any knowledge of the schema. Our group built the DbXplorer system in 2002, one of the first systems of its kind. This work was awarded the ICDE 2012 Influential Paper Award. We continue to investigate new ways to search structured data within the enterprise.
- Mohan Yang, bolin ding, surajit chaudhuri, and kaushik chakrabarti, Finding Patterns in a Knowledge Base using Keywords to Compose Table Answers, VLDB – Very Large Data Bases, August 2015.
- Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik, Discovering Queries based on Example Tuples, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), ACM – Association for Computing Machinery, June 2014.
- Meihui Zhang and Kaushik Chakrabarti, InfoGather+:Semantic Matching and Annotation of Numeric and Time-Varying Attributes in Web Tables, ACM SIGMOD, June 2013.
- Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri, InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables, in ACM SIGMOD Conference, 2012.
- Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das, DBXplorer: Enabling Keyword Search over Relational Databases, in ACM SIGMOD 2002, Association for Computing Machinery, Inc., 2002.