The goal of the Web Search & Data Mining Group of Microsoft Research Asia is to drive the next generation of Web search by leveraging data mining, machine learning, and knowledge discovery techniques for information analysis, organization, retrieval, and visualization.
Overview
The goal of the Web Search & Data Mining Group of Microsoft Research Asia is to drive the next generation of Web search by leveraging data mining, machine learning, and knowledge discovery techniques for information analysis, organization, retrieval, and visualization. In addition, in contrast with current Web search methods that essentially do document-level ranking and retrieval, the Web Search & Data Mining Group has created search at the object level to bring increased knowledge and intelligence to users.
A Glimpse at Several Core Innovations:
Structuralizing the Web
The biggest challenge facing both users and search engines over the next several decades is the continued, unstructured growth of the Internet. As such, search functions that can effectively and efficiently dig out machine-understandable information and knowledge layers from unorganized and unstructured Web data will be the key to supporting relevant search results. To meet this challenge, the group is exploring technologies, namely Web information extraction, deep Web mining, and Web structure mining that can automatically classify structures and extract objects from the Web. The information and knowledge gathered using these new techniques greatly improves the performance of current Web search and facilitates the creation of more sophisticated next-generation search technologies.
Vertical Search
Today's conventional search engines can be described as page-level search engines whose main function is to rank web pages according to their relevance to a given query. Driving the future of the search industry are functions that delve deeper into vertical domains to provide knowledge and intelligence to query results. At Microsoft Research Asia, the Web Search & Data Mining Group is addressing the greatest challenges faced by vertical search, including large scale web classification, object-level information extraction, object identification and integration, and object relationship mining and ranking. The results of these efforts are leading to more advanced search engines that deliver intelligence and insight to search results.
Large-scale, Experimental Web Search Platform
The Web Search & Data Mining Group is creating a large-scale search platform to efficiently store, parse, index and search billions of Web pages and other types of documents. The search platform is flexible enough to allow for testing of various state-of-the-art search techniques created at the lab for use in new technologies.
Mobile Search
The explosive growth of new computing devices, such as handheld computers, Windows Mobile-based PocketPCs, and SmartPhones, is driving demand for greater and more efficient information access. These devices, which leverage the power of the Web and allow greater access to information than ever before, are still not capable of performing at the level of a desktop PC. At Microsoft Research Asia, the Web Search & Data Mining Group is inventing new technologies to improve the mobile search and browsing experience and deliver the capabilities of a PC to users of these new devices. Project initiatives include developing innovative presentation schemes and user interfaces to facilitate search and browsing tasks on mobile devices and developing context-aware search technologies to address the special information needs of mobile users.
Multimedia Search
The Web Search & Data Mining Group is conducting research into new technologies that index multimedia content, such as images, videos, and audio. Through content analysis and advanced visualization techniques, the group is transforming today's conventional text-based search engines to include multimedia content, and thus is delivering more intelligent search results to users. For example, the group recently developed a new multimedia news reader that mines large archival news databases presenting text, map information, images, and background music within a unique user interface. It provides readers with a more efficient news search engine and a more enjoyable reading experience.
Web Data Management Group
The Web is described as a large-scale, unstructured, heterogeneous, and hidden information source, which poses challenges to the management of Web data. The mission of Web Data Management (WDM) Group is to develop systems and algorithms to address these challenges. In principle, we adopt a “data + infrastructure + tools†methodology to make Web data management as effective as a database system, and as flexible as an information retrieval system.
WebStudio – Building Infrastructure for Web Data Management
WebStudio is an infrastructure to provide large-scale Web data management and processing capabilities. It provides an integrated development environment (IDE) for use in quickly prototyping and conducting experiments at Web-scale. It is also a Web data management system to allow users to easily store, access, and manipulate Web data. Based on WebStudio, we are also exploring the possibility of building a new search engine with data-centric architecture.
Object-level Search
We are exploring a new paradigm to enable web search at the object level, extracting and integrating web information for various types of objects. We rank these objects in terms of their relevance and popularity in answering user queries. The core technologies of object-level search have been implemented in several working systems: Libra Academic Search (http://libra.msra.cn), Windows Live Product Search (http://products.live.com), and an object relationship search engine called Guanxi.
Deep Web Search
A large portion of Web data is residing in databases hidden behind the interfaces of many websites. We are working on technologies to acquire, extract, and integrate data from these Web databases to improve the coverage and quality of current search engines.
Web Search Evaluation
Web search is different from traditional information retrieval in many aspects, which demands new methodologies to measure its effectiveness. We are working on new ways to measure user-perceived relevance and diagnose specific issues in search engines.
Information Retrieval and Mining Group
The Information Retrieval and Mining Group’s research goal is to develop advanced technologies to help users accurately, quickly, and easily find information. Currently, the group is working on three projects: algorithms for improving web search, enterprise search, and community search. All are based on the technologies of machine learning, information retrieval, data mining, and natural language processing. The following research areas are being intensively investigated: search relevance and learning to rank, link analysis and web graph mining, anti-spam and adversarial information retrieval, document information extraction, and search log data mining.
Learning to Rank and Search Relevance
Ranking is a central problem in many applications within Information Retrieval, particularly search, and learning to rank is considered a promising approach for addressing the issue. The group is working on new methodologies in this area to improve accuracy of ranking in information-retrieval applications, particularly search. Related research focuses on invention of new learning models, new learning algorithms, and new criteria for learning to rank, as well as theoretical analyses and empirical evaluations.
Link Analysis and Web Data Mining
The Web contains billions of interconnected web pages. It can thus form a large-scale graph with web pages as nodes and links as edges. While link analysis and web graph mining are useful for many Web applications, the biggest challenge is how to effectively and efficiently process large-scale web graph data. The group is currently developing a distributed platform for graph and matrix computation that consists of features such as distributed graph data storage, incremental graph data indexing, parallel graph computation, job scheduling, and fault tolerance. The platform should enable new research and innovations on link analysis and web graph mining to be performed.
Document Information Extraction
Metadata of documents is useful for various kinds of document processing, including search, browsing, and filtering. Ideally, metadata is defined by the document authors and is then used by various systems. However, people seldom seriously define document metadata in a systematic way, and how to automatically extract metadata from documents is an important research issue. The group is employing machine learning approaches to conduct automatic metadata extraction from documents. Metadata fields include title, author, key terms, and document types such as Office, PDF, and HTML.
Selected Publications
- Patrick Baudisch, Xing Xie, Chong Wang, Wei-Ying Ma, Collapse-to-Zoom: Viewing Web Pages on Small Screen Devices by Interactively Removing Irrelevant Content,17th Annual ACM Symposium on User Interface Software and Technology (UIST 2004), TechNote, Sante Fe, NM, Oct. 2004.
- Xin Zheng, Deng Cai, Xiaofei He, Wei-Ying Ma and Xueyin Lin, Locality Preserving Clustering for Image Database ,12th ACM International Conference on Multimedia, New York City, USA, Oct. 2004.
- Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis ,12th ACM International Conference on Multimedia, New York City, USA, Oct. 2004 .
- Xiaofei He, Wei-Ying Ma, Hong-Jiang Zhang, Learning an Image Manifold for Retrieval,12th ACM International Conference on Multimedia, New York City, USA, Oct. 2004.
- Xin-Jing Wang, Wei-Ying Ma, Gui-Rong Xue, and Xing Li, Multi-Model Similarity Propagation and its Application for Web Image Retrieval,12th ACM International Conference on Multimedia, New York City, USA, Oct. 2004.
- Jun Yan, Benyu Zhang, Shuicheng Yan, Zheng Chen, Weiguo Fan, Wensi Xi, Qiang Yang, Wei-Ying Ma, and Qiansheng Cheng IMMC: Incremental Maximum Margin Criterion,10th ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, USA, Aug. 2004.
- Jiying Wang, Ji-Rong Wen, Fred Lochovsky and Wei-Ying Ma, Instance-based Schema Matching for Web Databases by Domain-specific Query Probing, The 30th International Conference on Very Large Data Bases (VLDB 2004), Toronto, Ontario, Canada, August 2004.
- Dou Shen, Zheng Chen, Hua-Jun Zeng, Benyu Zhang, Qiang Yang, Wei-Ying Ma, Yuchang Lu, Web-page Classification through Summarization, The 27th Annual International ACM SIGIR Conference (SIGIR'2004), July 2004.
- Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma. Learning To Cluster Search Results, The 27th Annual International ACM SIGIR Conference (SIGIR'2004), July 2004.
- Ji-Rong Wen, Ni Lao and Wei-Ying Ma, Probabilistic Model for Contextual Retrieval,The 27th Annual International ACM SIGIR Conference (SIGIR 2004), July 2004 .
- Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Block-based Web Search, The 27th Annual International ACM SIGIR Conference (SIGIR 2004), July 2004 .
- Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, Block-Level Link Analysis,The 27th Annual International ACM SIGIR Conference (SIGIR 2004), July 2004 .
- Xiaofei He, Deng Cai, Haifeng Liu and Wei-Ying Ma. Locality Preserving Indexing for Document Representation,The 27th Annual International ACM SIGIR Conference (SIGIR'2004), July 2004.
- Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, Organizing WWW Images Based on the Analysis of Page Layout and Web Link Structure,2004 IEEE International Conference on Multimedia and Expo., Taipei, Jun. 2004.
- Xing Xie, Wei-Ying Ma, Hong-Jiang Zhang, Maximizing Information Throughput for Multimedia Browsing on Small Displays,2004 IEEE International Conference on Multimedia and Expo., Taipei, Jun. 2004.
- Yusuo Hu, Xing Xie, Zonghai Chen, Wei-Ying Ma, Attention Model Based Progressive Image Transmission,2004 IEEE International Conference on Multimedia and Expo.Taipei,Jun. 2004
- Xin-Jing Wang, Wei-Ying Ma, and Xing Li, Data-Driven Approach for Bridging the Cognitive Gap in Image Retrieval, 2004 IEEE International Conference on Multimedia and Expo., Taipei, Jun. 2004.
- Ying Liu, Xiaofang Zhou, Wei-Ying Ma, Extracting Texture Features from Arbitrary-shaped Regions for Image Retrieval,2004 IEEE International Conference on Multimedia and Expo., Taipei, Jun. 2004.
- Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, Learning Block Importance Models for Web Pages,The Thirteenth World Wide Web conference (WWW 2004), 203-211, New York, May, 2004.
- Wensi Xi, Benyu Zhang, Yizhou Lu, Zheng Chen, Shuicheng Yan, Huajun Zeng, Wei-Ying Ma, and Edward A. Fox. Link Fusion: A Unified Link Analysis Framework for Multi-Type Interrelated Data Objects ,The Thirteenth World Wide Web conference (WWW 2004), 203-211, New York, May, 2004.



