Based in Beijing, China, the Machine Learning Group at Microsoft Research Asia focuses on machine learning research, knowledge discovery from large scale data and innovative systems powered by machine intelligence. With broad research efforts in areas like statistical learning, pattern recognition, text mining, optimization, information retrieval, recommendation, we are currently exploring practical technologies to enable large scale knowledge acquisition, to model user intention and to optimize the eco-system which involves users, rich clients and various online services. The Machine Learning Group is managed by senior researcher Zheng Chen.
Current Projects and Research Areas
Kable - Knowledge Table
Kable aims to extract structured knowledge from semi-structured and unstructured Web sites. It formulates the extracted knowledge in table format with each row stands for a domain entity and each column stands for an attribute such that the knowledge could be easily used for various Web applications such as search task simplification, attribute based search results filtering etc. Currently, Kable has focused on several domains such as “Movie”, “Company”, “Hotel”, “Book”, “Mobile Application” etc. There are three major ongoing sub-projects in Kable, which are,
- Kable – APEX. Here APEX stands for Auto Production of EXtractors. Kable APEX aims to automatically discover the domain specific sites and extract structured knowledge from the semi-structured and free text Web. APEX also models the extracted structured knowledge for supporting Web applications such as knowledge based Q&A and entity page index.
- Kable.Com. Here “Com” stands for Kable in Company domain. We do deep study on knowledge extraction and modeling in this specific domain. We not only extract and model general entity knowledge in this domain, but propose learning solutions for extracting and modeling unique knowledge for entities in this domain to allow information navigation.
- Kable Revenue. Here REVENUE stands for REleVancE aNd User Experience. We aim to leverage the Web knowledge for improving search and ads relevance. Simultaneously, we innovate novel knowledge based online user experiences.
Collaborative Modeling for Recommendation
Existing recommendation system mainly works on some specific domain, e.g., to recommend movie or music for users. In reality, users may use different services and interact with various types of objects. In this project, we focus on collaborative modeling research for bringing semantics into recommendation engine. The research problems of interest include:
- Unified learning framework to incorporate explicit concepts and implicit topics
- Modeling structured knowledge and unstructured/heterogeneous data sources
- Understanding interrelated entities and services across domains
- Learning with constraint to optimize both observations and model generalization
Cross domain recommendation and paid search are applications to test our research.
Context Aware Intent Engine
The goal of this project is to simplify user task completion by delivering user centric experience. We target to build an intelligent system (intent engine) which has the following capabilities: 1) understand users’ intent; 2) connect users with relevant services or applications to complete tasks; 3) guide the interactions between user and system with machine intelligence. Our ongoing efforts are related to the following research problems:
We are doing our research work on different devices (smart phone, slate, pc) and with various applications including but not limited to information retrieval, recommendation and personal assistant system.
Archived Projects and Research Areas
Large Scale Machine Learning Platform
The goal of this project is to provide a set of machine learning algorithms which can meet the requirements of research work and applications typically with very large scale data/features or applicable in multiple markets/domains. This platform provides but not limited to: classification, clustering, time series analysis, SVD, kernel distance function, statistical analysis, etc.
Behavioral Targeting (BT) attempts to deliver the most relevant advertisements to the most interested audiences, and is playing an increasingly important role in online advertising market. There are a set of challenges for behavioral targeting research, which are user representation and modeling, user segmentation and targeted ads delivery. We have multiple sub-projects for behavioral targeting research. We start with the "Self Service Behaviroal Targeting" project. The most recent released products come from our BT research is the "Intent based Behavioral Targeting". Our ongoing project is called the Ad Selection with display ads team.
Categorized Search is one of the solutions to organize search results by bringing categorization concepts into search products. Our focus is to scale up the whole solution, including: identifying popular galleries, mapping queries to galleries, creating intent profiles for galleries, and associating search result pages with intent profiles. We have implemented a tool used to organize queries and user search intents, which is a must-have for implementing the above search experience. We have used various kinds of data sources, including search log contributed by search engine users, Web pages provided by website editors and knowledge bases such as Wikipedia, Web directory organized by volunteers. Both processes are very effective and require not many human interactions, while the step of mapping result pages to intent profiles is fully automatic. At the same time, we will also exchange our thoughts about how to use our large scale machine learning toolkit to help scale up the solutions as well as our idea of how to evaluate Categorized Search system.
Grassroots users play important roles in today’s Web. They have intensive communications using various kinds of channels like online community, blog, instant messenger, etc. Meanwhile, these users also contribute content data to the Web, e.g., opinion data which contains the knowledge of grassroots users, large in scale and updates very frequently. In order to well organize and utilize these data, we try to collect, store and organize user opinion data. Based on the analysis and mining of opinion data, we try to understand the opinion expressed by grassroots users as well as their requirements, which will help other Web users to make purchase decision, to direct manufacturers to improve their products and services. Different from previous research work focusing on the analysis of social network, we focus on analyzing text opinion data in this project.
- Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen, Knowledge Graph and Text Jointly Embedding, in The 2014 Conference on Empirical Methods on Natural Language Processing, ACL – Association for Computational Linguistics, October 2014
- Lihong Li, Rémi Munos, and Csaba Szepesvari, On Minimax Optimal Offline Policy Evaluation, no. MSR-TR-2014-124, 15 September 2014
- Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen, Knowledge Graph Embedding by Translating on Hyperplanes, AAAI - Association for the Advancement of Artificial Intelligence, July 2014
- Chengtao Li, Jianwen Zhang, and Zheng Chen, Structured Output Learning with Candidate Labels for Local Parts, in Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2013), Springer, September 2013
- Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun Yan, Zheng Chen, and Zhifang Sui, Towards Accurate Distant Supervision for Relational Facts Extraction, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, August 2013
- Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun Yan, and Zheng Chen, Towards Accurate Distant Supervision for Relational Facts Extraction, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, August 2013
- Yan Xu, Yining Wang, Jian-Tao Sun, Jianwen Zhang, Junichi Tsujii, and Eric Chang, Building Large Collections of Chinese and English Medical Terms from Semi-Structured and Encyclopedia Websites, in PLOS ONE, PLoS, 9 July 2013
- Suin Kim, Jianwen Zhang, Zheng Chen, Alice Oh, and Shixia Liu, A Hierarchical Aspect-Sentiment Model for Online Reviews, in Proceedings of The Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13) , AAAI, July 2013
- Shipra Agrawal and Navin Goyal, Further optimal regret bounds for Thompson Sampling, in Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , April 2013
- Chengtao Li, Jianwen Zhang, Jian-Tao Sun, and Zheng Chen, Sentiment Topic Model with Decomposed Prior, in SIAM International Conference on Data Mining (SDM'13), Society for Industrial and Applied Mathematics, 2013