The Machine Learning Group at Microsoft Research Asia focuses on research and innovation in algorithms and technologies to discover knowledge from large-scale data. With the continuing increase of information on the Web and user behavior online, it is critical for service providers to learn and make sense of what information users will access and what they do with that information to allow developers to better meet user needs.
Group Mission |
The Machine Learning Group at Microsoft Research Asia is focused on research and innovation in machine learning algorithms and technologies to discover knowledge from large-scale data. With continued increase of information on the Web and user behavior online, it is critical for service providers to learn and make sense of what information users access and what they do with that information in order to better meet users' needs. Our research includes statistical learning, knowledge discovery, pattern recognition, text mining, optimization, game theory and information retrieval with large-scale and diverse data, such as textual data, Web log data and software usage data. Online advertising and other digital marketing areas are such test beds to apply the innovative technologies.
Highlight Projects |
Large-scale Machine Learning Toolkit (SIGMA)
The goal of this project is to provide a group of machine learning toolkits which can meet the requirements of research work and applications typically with large scale data/features or applicable in multiple markets/domains. The toolkit includes but not limited to: classification, clustering, time-serise analysis, SVD, kerneal distance function, statistical analysis, etc.
Opinion Search
Grassroots users play important roles in today’s Web. They have intensive communications using various kinds of channels like online community, blog, instant messenger, etc. Meanwhile, these users also contribute content data to the Web, e.g., opinion data which contains the knowledge of grassroots users, large in scale and updates very frequently. In order to well organize and utilize these data, we try to collect, store and organize user opinion data. Based on the analysis and mining of opinion data, we try to understand the opinion expressed by grassroots users as well as their requirements, which will help other Web users to make purchase decision, to direct manufacturers to improve their products and services. Different from previous research work focusing on the analysis of social network, we focus on analyzing text opinion data in this project.
Categorized Search
Categorized Search is one of the solutions to organize search results by bringing categorization concepts into search products. Our focus is to scale up the whole solution, including: indentifying popular galleries, mapping queries to galleries, creating intent profiles for galleries, and associating search result pages with intent profiles. We have implemented a tool used to organize queries and user search intents, which is a must-have for implementing the above search experience. We have used various kinds of data sources, including search log contributed by search engine users, Web pages provided by website editors and knowledge bases such as Wikipedia, Web directory organized by volunteers. Both processes are very effective and require not many human interactions, while the step of mapping result pages to intent profiles is fully automatic. At the same time, we will also exchange our thoughts about how to use our large scale machine learning toolkit to help scale up the solutions as well as our idea of how to evaluate Categorized Search system.
Self Service Behavioral Targeting
Behavioral Targeting (BT) attempts to deliver the most relevant advertisements to the most interested audiences, and is playing an increasingly important role in online advertising market. A key problem of BT is how to segment users according to their behaviors for targeted ads delivery. The traditional BT strategies generally predefine a fixed number of user categories and then classify users into these categories based on their search or browsing behaviors. However, the predefined user categories/segments cannot well satisfy the various requirements of different advertisers. In this work, we propose to allow advertisers to interact with our system to self-define their preferred user segments. We provide three major functionalities in the self service BT system, (1) advertisers can customize targeted user segments by keywords, demographics etc; (2) advertisers can expand user segments to targeted scale; and (3) we can visualize the targeted user segment to advertisers. In terms of computation, the key challenge of the self service BT system is how to segment a large number of users according to their dynamically changing behaviors. In this work, we propose to adopt the Minwise Hashing algorithm in the MapReduce framework for user clustering. It is a very efficient algorithm in dealing with the large scale user data.



