Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Abstracts

Internet Services Workshop 2008

Microsoft Research Asia – Tsinghua University


On This Page

Wei-Ying MA
Principal Researcher/Research Area Manager
Microsoft Research Asia

Title
Rethink Search and Understand Its Strategic Position in Cloud Computing and Internet Economies

Abstract:
The competition in search has been driving a lot of innovation and investment on next generation Internet services to provide the computing platform for Internet economies on a global scale. Different from traditional products, search needs research and continuous experiments, and an infrastructure for “scale” experiments and data-intensive computing is often required to evaluate the effectiveness of newly invented algorithms. In the past few years, I have a great fortune to work with many talented people in Microsoft on search. I would like to share some lessons we learned and also talk about why search continues to hold a strategic position in the future of Web because the breakthroughs from search are also fueling the innovations in cloud computing and computational economies.

Robbert van RENESSE
Principal Research Scientist, the Department of Computer Science
Cornell University

Title
Trials and Tribulations in Scaling Distributed Systems

Abstract:
In this presentation, Dr. Robbert van RENESSE will presents various personal experiences with conceiving, designing, building, promoting, deploying, and maintaining a large distributed system. While simulations and limited deployments often can demonstrate that distributed systems can scale to hundreds of thousands of compute nodes, real deployments often run into various kinds of practical limitations, not all of which are necessarily technical. He will argue for "one-size-fits-all" systems that can adapt to a wide variety of operating conditions and can be upgraded incrementally as needed. He will argue that a theoretical foundation is necessary for a system to be both flexible and robust.
Download Slides

Rakesh AGRAWAL
Technical Fellow
Microsoft

Title
Humane Data Mining

Abstract:
Data Mining has made tremendous strides in the last decade. It is time to take data mining to the next level of contributions, while continuing to innovate for the current mainstream market. We postulate that a fruitful future direction could be humane data mining: applications to benefit individuals. The potential applications include personal data mining (for example, personal health), enabling people to get a grip on their world (for example, dealing with the long tail of search), enabling people to become creative (for example, inventions arising from linking non-interacting scientific literature), enabling people to make contributions to society (for example, education collaboration networks), and data-driven science (for example, study ecological disasters, brain disorders). Rooting our future work in these (and similar) applications, will lead to new data mining abstractions, algorithms, and systems.
Download Slides

Yi-Min WANG
Director, ISRC–Redmond
Microsoft Research

Title
Adversarial Web Crawling with Strider Monkeys

Abstract:
Crawlers are the most fundamental component of any search engine because they define what information gets stored in the index. Traditional static crawlers treat each URL as a pointer to a Web page that contains static HTML content. Search-spammers have been exploiting the well-known fact that accessing a URL can actually trigger highly dynamic executions of an arbitrary number of Web programs from an arbitrary number of third-party Web sites, with arbitrarily complex code obfuscation. In this talk, I will describe the Strider Monkeys that “crawl Web programs” by executing all scripts and tracking all redirections. In particular, I will present the SearchMonkeys, which mimic interactive search-engine users, and HoneyMonkeys, which detect unauthorized software installations following successful executions of vulnerability-exploit scripts.
Download Slides

Hsin-Hsi CHEN
Professor, Department of Computer Science and Information Engineering
National Taiwan University

Title
Tag Prediction for Effective Social Media Retrieval

Abstract:
As Web 2.0 progresses, social annotation is becoming a more popular manner for Web users to manage interesting resources or URLs. Although there are many potential applications of social annotation, the requirement of sufficient annotations, especially critical for newly created Web resources, limits its applicability. Besides, freely-chosen and open-ended tag naming also decreases the usability of social annotation in resource recommendation and effective retrieval. In this paper, we propose a tag normalization algorithm to unify the users’ annotation. Meanwhile, we explore some general phenomena in a social annotation system and propose a supervised tag prediction model to predict the stabilized tag set of a resource, with feedback of a small amount of user annotation records. The experiments show that a large portion of the stabilized tag set is predicted, and it is feasible to reduce the requirement of sufficient user annotations in the applications of social annotations.
Download Slides

Qiang YANG
Professor, Department of Computer Science and Engineering
Hong Kong University of Science and Technology

Title
Towards Personalized Query Classification

Abstract:
With the help of search engines, Web queries are becoming a major bridge between Web users and services that search engines provide, such as advertisement and Web page search. Query classification (QC) is a task that classifies Web queries into topical categories. Since queries are usually short in length and ambiguous, the same query may belong to different categories. In this project, we develop a novel algorithm for personalized query classification (PQC) through user preference learning. User preferences are often hidden in feedback such as click through data. Thus, we present a novel approach to learn from click through log data and other information sources in an effort to enhance the performance of query classification. The applications of the work can be seen from a variety of areas, including online advertisement and better search result ranking and user interfaces.
Download Slides

Kyu-Young WHANG
Professor
Korea Advanced Institute of Science and Technology

Title
Answering Top-k Queries Using a Partitioned-Layer Index

Abstract:
A top-k query returns k tuples with the highest (or the lowest) scores from a relation. The score is computed by combining the values of one or more attributes. We focus on top-k queries having monotone linear score functions. Layer-ordering methods are well-known studies that process top-k queries effectively. These methods construct a database as a single list of layers. Here, the i-th layer has the objects that can be the top-i object. Thus, these methods answer top-k queries by reading at most k layers. Query performance, however, is poor when the number of objects in each layer (simply, the layer size) is large. In this project, we propose a new layer-ordering method, called the Partitioned-Layer Index (simply, the PL Index), that significantly improves query performance by reducing the layer size. The PL Index uses the notion of partitioning, which constructs a database as multiple sublayer lists instead of a single layer list subsequently reducing the layer size. The PL Index also uses the new notion of the convex skyline, which is a subset of the skyline, to construct a sublayer to further reduce the layer size. The query performance of the PL Index is insensitive to the weights of attributes (called the preference vector) of the score function and is approximately linear in the value of k. The PL Index is capable of tuning query performance for the most frequently used value of k by controlling the number of sublayer lists. Experimental results using synthetic and real data sets show that the query performance of the PL Index significantly outperforms existing methods except for small values of k (say, k <= 9).

Guirong XUE
Professor
Shanghai Jiao Tong University

Title
Deep Classification in Large-scale Web Hierarchies

Abstract:
Classifying Web documents into categories can assist the Web users to quickly browse the content by topics. Currently, most classification algorithms are focused on categorizing the Web documents into shallow categories such as the top two levels of the Open Directory Project. Such a classification method does not consider more detailed topic-related information for the user because the first two levels are often too coarse. However, classification on a large-scale hierarchy is known to be intractable for many target categories with relationships among them. In this talk, we present our research on categorizing Web documents into categories with a large-scale taxonomy. We will show the performance of our proposed algorithms on the Open Directory Project with more than 130,000 categories.
Download Slides

Dong XU
Doctor
Nanyang Technological University

Title
Near Duplicate Image Identification with Spatially Aligned Pyramid Matching

Abstract:
Dong XU introduces our recent work on Image Near Duplicate Identification using Spatially Aligned Pyramid Matching. The method robustly handles spatial shifts as well as scale changes. Images are divided into both overlapped and non-overlapped blocks over multiple levels. In the first matching stage, pairwise distances between blocks from the examined image pair are computed using SIFT features and Earth Mover’s Distance (EMD). In the second stage, multiple alignment hypotheses that consider piecewise spatial shifts and scale variation are postulated and resolved using integer-flow EMD. The method clearly outperforms existing methods through extensive testing on the Columbia Near Duplicate Image Database and two new datasets. Dr. Xu also introduces the ongoing research projects in research group.

Minlie HUANG
Doctor
Tsinghua University

Title
Mining Reviews for Product Comparison and Recommendation

Abstract:
As the Internet sees the Web 2.0, more and more opinion and review information is posted via blogs, comment systems, and other Web mashups. People post their purchase experiences, usage experiences, and other related comments on digital products such as digital cameras, videos, cell phones, and more. Mining the opinion and review information will help users to choose their favorite products, providers to improve the quality, and governments to make their policy. In this talk, we introduce our work on mining reviews for product comparison and recommendation. Based on the opinion and review contents, we compare digital products from the subjective perspective, the objective perspective, and the overall perspective, with different computation models. We use the product evolution tree to recommend suitable alternative to users. We also show the primitive results and simple demos in the talk.
Download Slides

Seung-Jin CHOI
Associate Professor, Department of Computer Science
Pohang University of Science and Technology

Title
Weighted Nonnegative Matrix Factorization for Collaborative Prediction

Abstract:
Learning fruitful representation from data plays a critical role in machine learning and data mining. Nonnegative matrix factorization (NMF) is a widely-used method for low-rank approximation (LRA) of a nonnegative matrix (matrix with only nonnegative entries), where nonnegativity constraints are imposed on factor matrices in the decomposition. A large body of past work on NMF has focused on the case where the data matrix is complete. In practice, however, we often encounter with an incomplete data matrix where some entries are missing (for example, a user-rating matrix). Weighted low-rank approximation (WLRA) has been studied to handle incomplete data matrix. However, there is only a few works on weighted nonnegative matrix factorization (WNMF). Existing WNMF methods are limited to a direct extension of NMF multiplicative updates, which suffer from slow convergence while the implementation is easy. In this presentation, Seung-Ji Choi talks about relatively fast and scalable algorithms for WNMF, borrowed from well-studied optimization techniques: (1) alternating nonnegative least squares and (2) generalized expectation maximization. He demonstrates the useful behavior of WNMF, in a task of collaborative prediction.
Download Slides

Katsumi TANAKA
Professor, Graduate School of Informatics
Kyoto University

Title
Multimedia Search and Information Credibility

Abstract:
As computers and computer networks become more common, a huge amount of information, such as that found in Web documents, has been accumulated and circulated. Such information gives people a framework for organizing their private and professional lives. However, in general, the quality control of Web content is insufficient due to low publishing barriers. As a result, there is much mistaken and unreliable information on the Web that can have detrimental effects on users. This calls for technology that would facilitate judging the trustworthiness of content and the accuracy of the information that users encounter on the Web. Such technology should be able to handle a wide range of tasks: extracting credible information related to a given topic, organizing this information, detecting its provenance, clarifying background, facts, and other related opinions and the distribution of them, and so on. In this talk, we introduce our research activities on multimedia search and their information credibility. In particular, we report two issues: how to improve conventional Web image search by using Web 2.0 contents and how to analyze the credibility of multimedia contents (images and video) on the Web.
Download Slides

Evelyne VIEGAS
Senior Research Program Manager
Microsoft Research

Title
Data Intelligence to drive Search Innovation

Abstract:
Evelyne presents the Data Intelligence initiative, which deals with finding means to enable academic research with large scale real world data to drive innovation in search while preserving the privacy of the users. She presents previous efforts and the roadmap to move this effort forward to go beyond search with semantic computing technologies.
Download Slides

Xing XIE
Doctor, Lead Researcher
Microsoft Research Asia

Title
Moving Toward Next Generation Social Networks: A Location Based Direction

Abstract:
Social networking services have become extremely popular in recent years, especially for young people. However, they are still rooted in the virtual world. People need to sit behind a desktop computer to upload photos, write blogs, and communicate with friends. On the other hand, the development of wireless networks and location sensing technologies have made it easier to track and share personal location information on the fly. By adding a location dimension, we can bring social networking back from the virtual world into real life and allow real-life experiences to be shared in a more convenient way. At Microsoft Research Asia, we are working on various technologies toward building location based mobile social networks. In this talk, Xin Xie presents our recent work on understanding users in such networks, which is an essential task for providing personal experience and targeted advertisements. In particular, we studied GPS trajectory transportation mode categorization and co-located query pattern mining problems.
Download Slides

Hongtao CHEN
Doctor, Virtual Earth Solution Specialist, GCR
Microsoft

Title
Location Based Services and Virtual World

Abstract:
Location Based Services (LBS) are more than the Yellow Page, locating a mobile terminal, or local search. They involves what, when, where, and who is interactive with the spatial information. A comparison was made between the physical and virtual world, and two kinds of virtual worlds are described, including mirrored virtual world and imaginative virtual world. Microsoft’s contribution to LBS was also introduced: Microsoft Virtual Earth is one of the leading geo-info platforms. It provides rich imagery including high resolution satellite images, oblique Birdseye view, 3D models, and other local search related features. Some demos will be introduced, demonstrating the powerful applications of Virtual Earth.
Download Slides

Xian-Sheng HUA
Doctor, Lead Researcher
Microsoft Research Asia

Title
Internet Multimedia Search and Mining at Microsoft Research Asia

Abstract:
With the explosion of video and image data available on the Web, multimedia search becomes more and more important. Mining semantics and other useful information to facilitate other related applications from this huge media dataset also has gained muchattention from both academia and industry. In this talk, we discuss the trends in this area and introduce recent progress of Microsoft Research Asia in this area.

Tie-Yan LIU
Doctor, Lead Researcher
Microsoft Research Asia

Title
BrowseRank: Letting Web Users Vote for Page Importance

Abstract:
In this talk, a new method for computing page importance, referred to as BrowseRank, is presented. The conventional approach to compute page importance is to exploit the link graph of the Web and to build a model based on that graph. For instance, PageRank is such an algorithm, which employs a discrete-time Markov process as the model. Unfortunately, the link graph might be incomplete and inaccurate with respect to data for determining page importance, because links can be easily added and deleted by Web content creators. In this work, we propose computing page importance by using a ’user browsing graph’ created from user behavior data. In this graph, vertices represent pages and directed edges represent transitions between pages in the users’ Web browsing history. Furthermore, the lengths of staying time spent on the pages by users are also included. The user browsing graph is more reliable than the link graph for inferring page importance. We further proposes using the continuous-time Markov process on the user browsing graph as a model and computing the stationary probability distribution of the process as page importance. An efficient algorithm for this computation has also been devised. In this way, we can leverage hundreds of millions of users’ implicit voting on page importance. Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks.

Zheng CHEN

Title
Large-Scale Log Analysis and Mining

Abstract:
The accumulated Web usage data (log) is increased dramatically with the rapid growth of Internet service. To analysis and mining, the log data is one of the ways to provide data-driven decision making for online business. While the scalability of analysis and mining algorithm is a big barrier for the development of log mining platform, we built an open log analysis and mining platform, called LAMP, to enable researchers to easily develop / research log mining related algorithms on different kinds of log data, which supports large scale and daily updated log volume, incremental analysis, and easy reporting. Web analytics, IME, and keyword technology (KT) are example applications that are built on top of LAMP.
Download Slides

Beryl PLIMMER
Doctor
University of Auckland

Title
Sharing Multiple Ink Annotations on the Web

Abstract:
Our vision for this project is a Web space where a document can be annotated with digital ink by any number of people. The annotations may have different purposes; for example, they may be feedback to the primary author for preparation of multi-authored documents in a corporate environment. Or they may be a teacher’s or students’ comments on course notes, a study group’s shared considerations of a text, or an extended families’ shared annotations on a family calendar. Potentially, they could contribute to social discourse where an initial document acts as the trigger for a discussion. Realizing this vision requires us to explore a number of technical and design issues.
The first challenge is simply to provide the functionality within a browser to create, store, retrieve, and share digital ink data. We are exploring two approaches to this: first, within a page Silverlight is used to support digital ink on specifically designed Web sites; second, a more general approach is an Internet Explorer add-in to provide the same functionality on any Web page. Once recorded, maintaining the correct spatial position of annotations as the underlying document reflows as a result of changes is necessary. This is a well documented research problem that is only partially solved. Finally, if a Web document has multiple annotations by different authors—say study notes annotated by an entire class—there are interesting visualization, filtering, and search questions to be addressed. Successful deployment of this technology will be a major step towards enriching the online document experience.