This project pursues research on data exploration that identifies techniques for flexible ways to query, browse and aggregate data. One of our goals is to support approximate matches and ranked search in the database context. We also like to enable data browsing and querying services for XML that can interoperate between text, structured, and semi-structured (e.g., mail messages) data. We also investigate efficient approximate query processing techniques for answering ad-hoc aggregate queries (e.g. decision support or OLAP queries).
Goal
Keyword search over web and enterprise documents is a very popular mechanism for finding relevant information. In both enterprise and web scenarios, document collections coexist with large structured databases. Therefore, keyword search over structured databases, particularly in collections involving both structured and unstructured documents, is an important problem. In the data exploration project, we explore the algorithmic and systems issues arising out of the goal of searching and analyzing document collections and structured databases together. We want to enable two broad keyword search scenarios.
First, we want to identify structured database objects or entities relevant to a query, even if query keywords are not present in the entity name or description columns. Identifying entities in a database (e.g., products), for queries in which all query keywords do not match those in an entity name or description is an important and challenging problem. For example, we may want to return relevant digital cameras for a user query such as [fast action digital camera] which is searching for digital cameras suitable for taking good pictures involving fast moving objects. This functionality is very useful for improving vertical search engines as well as for enhancing web search (or in general document search) functionality. We are studying algorithmic and systems issues arising out of this goal. Many of the techniques we develop for achieving the above goal are also applicable to improve individual components in a web search engine such as query classification.
Second, we want to to enable efficient ranked keyword search on logical entities (obtained by joining multiple relations) in databases without materializing them. We are studying the algorithmic and systems issues arising out of this goal in the context of full text search in database systems and in the context of enterprise search engines. A related systems problem that we study is the efficient processing of keyword queries IR engines.
Once we have determined a relevant set of structured entities from one or multiple 'vertical' databases for a given search query, we then need to integrate these with 'regular' web search results. State-of-the-art web search engines typically show content from a variety of sources for many queries; given that the space available on the result page is limited, this results in the issue of selecting between different content types to be displayed.
In this context, we have studied the problems of (a) selecting an appropriate vertical (database) from which to display context, (b) prediction of the click-through rates for such content and (c) specialized index structures for matching advertisements to search queries.
If you have questions about this project, please contact the Data Exploration research team (dmx@microsoft.com).
- Ping Li, Anshumali Shrivastava, and Arnd Christian König, GPU-Based Minwise Hashing, in 21st International World Wide Web Conference , Association for Computing Machinery, Inc., 16 April 2012
- Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri, InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables, in ACM SIGMOD Conference, 2012
- Chi Wang, Kaushik Chakrabarti, Tao Cheng, and Surajit Chaudhuri, Targeted Disambiguation of Ad-hoc, Homogeneous Sets of Named Entities, in World Wide Web Conference, 2012
- Kaushik Chakrabarti, Surajit Chaudhuri, Tao Cheng, and Dong Xin, A Framework for Robust Discovery of Entity Synonyms, in SIGKDD, 2012
- Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König, Hashing Algorithms for Large-Scale Learning, in Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS), Neural Information Processing Foundation, 12 December 2011
- Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König, b-Bit Minwise Hashing for Large-Scale Learning, in Big Learning 2011: NIPS 2011 Workshop on Algorithms, Systems, and Tools for Learning at Scale , Neural Information Processing Foundation, December 2011
- Ping Li and Arnd Christian König, Theory and Applications of b-Bit Minwise Hashing, in Communications of the ACM, ACM, August 2011
- Klaus Berberich, Arnd Christian König, Dimitrios Lymberopoulos, and Peixiang Zhao, Improving Local Search Ranking through External Logs, in 34th Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011) , ACM, July 2011
- Senjuti Basu Roy and Kaushik Chakrabarti, Location-Aware Type Ahead Search on Spatial Databases: Semantics and Efficiency, in ACM SIGMOD Conference, June 2011
- Bahman Bahmani, Kaushik Chakrabarti, and Dong Xin, Fast Personalized PageRank on MapReduce, in ACM SIGMOD Conference, June 2011
- Fei Wang, Chenhao Tan, Ping Li, and Arnd Christian König, Efficient Document Clustering via Online Nonnegative Matrix Factorizations , in Eleventh SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 28 April 2011
- Kaushik Chakrabarti, Surajit Chaudhuri, and Venkatesh Ganti, Interval-Based Pruning for Top-k Processing over Compressed Lists, in ICDE Conference, IEEE, April 2011
- Bolin Ding and Arnd Christian König, Fast Set Intersection in Memory, in 37th International Conference on Very Large Databases (VLDB), Very Large Data Bases Endowment Inc., 26 January 2011
- Kaushik Chakrabarti, Surajit Chaudhuri, Tao Cheng, and Dong Xin, Automatically Tagging Entities with Descriptive Phrases, in WWW (Poster paper), 2011
- Fei Wang, Ping Li, and Arnd Christian König, Learning a Bi-Stochastic Data Similarity Matrix, in The 10th International Conference on Data Mining (ICDM), IEEE, 14 December 2010
- Ping Li, Arnd Christian König, and Wenhao Gui, b-Bit Minwise Hashing for Estimating Three-Way Similarities, in Twenty-Fourth Annual Conference on Neural Information Processing Systems (NIPS), 6 December 2010
- Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Arnd Christian König, and Dong Xin, Query Portals: Dynamically Generating Portals for Entity-Oriented Web Queries, in International Conference on Management of Data (SIGMOD 2010) , Association for Computing Machinery, Inc., 6 June 2010
- Ping Li and Arnd Christian König, b-Bit Minwise Hashing, in Nineteenth International World Wide Web Conference (WWW 2010), Association for Computing Machinery, Inc., 26 April 2010
- Venkatesh Ganti, Arnd Christian König, and Xiao Li, Precomputing Search Features for Fast and Accurate Query Classification, in Third ACM International Conference on Web Search and Data Mining (WSDM 2010), Association for Computing Machinery, Inc., 4 February 2010
- Arnd Christian König, Michael Gamon, and Qiang Wu, Click-Through Prediction for News Queries , in 32nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), Association for Computing Machinery, Inc., July 2009
- Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Arnd Christian König, and Dong Xin, Exploiting Web Search Engines to Search Structured Information , in 18th International World Wide Web Conference (WWW 2009), Association for Computing Machinery, Inc., April 2009
- Surajit Chaudhuri, Venkatesh Ganti, and Dong Xin, Exploiting Web Search To Generate Synonyms For Entities, in 18th International World Wide Web Conference, Association for Computing Machinery, Inc., April 2009
- Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Arnd Christian König, and Dong Xin, Query Portals: Dynamically Generating Portals for Web, in 18th International World Wide Web Conference (WWW 2009), Association for Computing Machinery, Inc., April 2009
- Arnd Christian König, Kenneth Church, and Martin Markov, A Data Structure for Sponsored Search, in 24th International Conference on Data Engineering (ICDE), IEEE Computer Society, 29 March 2009
- Venkatesh Ganti, Arnd Christian König, and Rares Vernica, Entity Categorization over Large Document Collections , in 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2008), Association for Computing Machinery, Inc., 24 August 2008
