Share this page
Share this page E-mail this page Print this page RSS feeds
Home > Groups > Information Retrieval and Mining
Information Retrieval and Mining

The mission of the Information Retrieval and Mining group is to develop advanced technologies to help users quickly, easily, and accurately access information.

Overview

We aim at developing fundamental technologies for general web search and enterprise search. Our main technology areas include machine learning, information retrieval, data mining, and natural language processing. We partner with Microsoft Live Search and SharePoint Search. Currently, we are working on five projects: Learning to Rank, Search Result Ranking, Data Selection in Search, Search Log Data Mining, and Next Generation Enterprise Search.

Learning to Rank

Learning to rank is a task that automatically constructs a ranking model (function) using training data, such that the model can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance defined in a specific application. We have been working on the research and development of learning to rank and have made significant achievements. Specifically, we have developed a number of new algorithms for learning to rank including the listwise algorithms of ListNet and ListMLE, the direct optimization method of AdaRank, and a global ranking method using Continuous CRF. We have additionally conducted theoretical analysis on listwise learning algorithms. Benchmark data LETOR has also been developed from TREC data and has been released to the research community from our group. We are also actively involved in activities on learning to rank in the research communities.

Search Result Ranking

Ranking is the central issue in search; given a query, we aim to rank the retrieved documents in terms of relevance and importance of the documents. We are working on search result ranking from several aspects: query understanding, document understanding, and query-document matching. For example, we have developed a method for query refinement using a conditional random fields model, a method for discovering the topic of hypertext documents, and a model for calculating the relevance between a query and document based on metric distance learning. The latter matching model has been transferred to Microsoft SharePoint and is going to be used in its next release.

Data Selection in Search

Billions of web pages are now available on the web. How to calculate the importance of the web pages and select the most important (high quality) ones when crawling, indexing, and ranking is a critical issue in web search. The importance of web pages is usually calculated using web graph data. We have developed an experimental platform for conducting mining and learning experiments on large-scale graph data. The platform is built on Microsoft’s distributed computing infrastructure and is efficient, flexible, and easy to use. We are also developing new algorithms for page importance calculation on top of the platform. For example, BrowseRank is one of these algorithms, which creates user browsing graph data from users’ behavioral data, defines a continuous time Markov model, and calculates page importance using the model. The related paper received the best student paper award at SIGIR 2008.

Search Log Mining

The log data at a search engine can be used to analyze users’ search behavior and to develop search technologies to improve users’ search experiences. We are developing a search log mining platform to enable researchers and engineers to conduct data mining on user search behavior data, including search session data and click-through data. The platform is based on Microsoft’s distributed computing infrastructure and is efficient and easy to use. We are also conducting research on search log mining on top of the platform. For example, we have proposed a `context aware query suggestion’ method, and received the best application award at KDD 2008.

Next Generation Enterprise Search

We work with the SharePoint Search group to jointly incubate the next generation enterprise search technologies. We propose a new approach to enterprise information management and search, in which we organize information in advance, using information extraction and machine learning technologies, and then provide this information to the user during a search . Our technologies include document metadata extraction, expert/expertise mining, and definition and FAQ extraction. We have developed a prototype system and deployed it within Microsoft; some of the technologies we developed have been transferred to SharePoint. Recently, we extended the system to enterprise social computing and made significant progress in this area.

Recent Publications 

  • Tao Qin, Tie-Yan Liu, Xu-Dong Zhang, De-Sheng Wang, Hang Li, Global Ranking Using Continuous Conditional Random Fields, Advances in Neural Information Processing Systems 21, 2009, 1281-1288.
  • Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li, HTM: A Topic Model for Hypertexts, Prof. of EMNLP 2008, 514-522.
  • Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhohng Chen, Hang Li, Context-Aware Query Suggestion by Mining Click-Through and Session Data, Proc. of KDD 2008, 875-883. SIGKDD’08 Best Application Paper Award
  • Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, Hang Li, Listwise Approach to Learning to Rank –Theory and Algorithm, Proc. of ICML 2008, 1192-1199.
  • Yanyan Lan, Tie-Yan Liu, Tao Qin, Zhiming Ma, Hang Li, Query Level Stability and Generalization in Learning to Rank, Proc. of ICML 2008, 512-519.
  • Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li, BrowseRank: Letting Users Vote for Page Importance, Proc. of SIGIR 2008, 451-458. SIGIR’08 Best Student Paper Award
  • Xiubo Geng, Tie-Yan Liu, Tao Qin, Andrew Arnold, Hang Li, Heung-Yeung Shum, Query Dependent Ranking with K Nearest Neighbor, Proc. of SIGIR 2008, 115-122.
  • Jun Xu, Tie-Yan Liu, Min Lu, Hang Li, Wei-Ying Ma, Directly Optimizing Evaluation Measures in Learning to Rank, Proc. of SIGIR 2008, 107-114.
  • Jiafeng Guo, Gu Xu, Hang Li, Xueqi Cheng, A Unified and Discriminative Model for Query Refinement, Proc. of SIGIR 2008, 379-386.
  • Rong Jin, Hamed Valizadegan, Hang Li, Ranking Refinement and Its Application to Information Retrieval, Proc. of WWW 2008, 397-406.
  • Tao Qin, Tie-Yan Liu, Xu-Dong Zhang, De-Sheng Wang, Wen-Ying Xiong, Hang Li, Learning to Rank Relational Objects and Its Application to Web Search, Proc. of WWW 2008, 407-416.
  • Gu Xu, Hang Li, Wei-Ying Ma, Fora: Leveraging the Power of Internet Communities for Question Answering, Proc. of WWW 2008 Workshop on QA Web.
  • Xiaonan Ji, Gu Xu, James Bailey, and Hang Li, Mining, Ranking, and Using Acronym Patterns, Prof. of APWeb-2008, 371-382.
  • Tao Qin, Xu-Dong Zhang, Ming-Feng Tsai, De-Sheng Wang, Tie-Yan Liu, Hang Li, Query-level Loss Functions for Information Retrieval, Information Processing and Management, 44, 838-855, 2008.