*
Quick Links|Home|Worldwide
Microsoft*
Search for


Web Search and Data Mining - Silicon Valley

Overview

We are conducting numerous projects aimed at improving web search. Our projects range from developing core systems infrastructure, to developing novel algorithms and heuristics for ranking and classifying web pages, to study basic properties of the web at large, to mining query logs for temporal patterns.

Contributors

Steve Chien, Cynthia Dwork, Dennis Fetterly, Michael Isard, Mark Manasse, Frank McSherry, Marc Najork, Rina Panygrahy

Projects

Accelerated Link-based Ranking Computations

A class of query-independent link-based ranking algorithms operate on the entire web graph, or at least on the fraction that is known to a search engine. Moreover, many of these algorithms use iterative methods to compute a fixed point of the ranks of all known pages. Given the enormous size of typical search engine corpora, this computation can be computationally challenging. We have developed algorithms for greatly accelerating such computations, and have implemented them at scale. We are in the process of exploring further algorithmic and implementation optimizations.

Investigation of Blogs

Given the rising popularity of blogging, we are currently investigating the impact of blogs on the web at large. In particular, we are interested in the evolution of blogs and their interconnectedness.

Nocturnal

Nocturnal is an automated Messenger-based information sharing and collaboration web-search system. The vision behind the Nocturnal project is to harness the power of social links for sharing information such as recommendations and reviews. The key idea is to leverage the existing messaging network in order to bootstrap the system with natural, automatic social communities. The communities are formed by a user's buddies list, his buddies' buddies and their buddies, and so on. Each social circle shares much in common, without requiring users to subscribe to any new service, eliciting information about users' hobbies and social behavior, etc.

Query Log Mining

Query logs contain vast amounts of useful information: They indicate current trends in user interests, can be used to guide automatic spelling correction, and are a source of associated queries. We have mined query logs for temporal patterns, and have found that queries with similar temporal distributions tend to be semantically related.

Scalable Hyperlink Store

The Scalable Hyperlink Store is a specialized database for storing the graph induced by web pages and hyperlinks between them. It is designed to be highly scalable (i.e. capable of holding the entire graph induced by the MSN Search corpus) and to allow microsecond-range access to nodes and edges in that graph. Performance is achieved by maintaining a highly compressed representation of the graph in memory, while scalability is achieved by distributing the graph over a cluster of machines.

Studies of Web Evolution

We have studied the evolution of several aspects of the web at large. In particular, we have investigated how much web pages change over time, and how stable sets of near-identical web pages are over time. A solid understanding of these properties is beneficial to search engines, which typically refresh their index continuously and need to decide on recrawl policies that optimizes content freshness and index quality under bandwidth constraints.

Tie-aware Information Retrieval Performance Measures

We have studied the problem of measuring the retrieval performance of Information Retrieval Systems (such as web search engines) in the presence of tied scores. Tied scores arise commonly when one tries to assess the performance of a single discrete feature, such as in-link count, query-result click-through, or page visits. Standard definitions of most performance measures do not consider the possibility of ties. We have defined tie-aware variants of six common measures (precision, recall, F-measure, mean average precision, mean reciprocal rank, and normalized discounted cumulative gain) that conceptually average performance over all well-sorted permutations of a result-vector, and yet are about as efficient as the standard, tie-oblivious versions. You can download C# implementations of both tie-oblivious and tie-aware versions of these measures.

Web Crawling

Commercial search engines such as MSN Search assemble their corpus in a biased fashion, by attempting to maintain a fresh copy of high-quality pages. There are situations where the biased nature of such a corpus is undesirable, e.g. when studying statistical properties of portions of the web that search engines are biased against. To enable such studies, we have built a highly customizable, yet high performance research web crawler.

Web Spam Detection

Given the significant amount and fast growth of web-based commerce and the crucial role of search engines in directing monetizable traffic to web sites, it is not surprising that some web site operators try to improve their traffic by publishing web pages that are targeted at search engines, but are useless to human viewers. This practice is commonly known as web spam. Web spam is a nuisance for both web searchers and search engines. We are investigating heuristics for identifying spam web pages.

Microsoft Product Engagement

  • We have been closely collaborating with MSN Search in the design and implementation of their algorithmic search service, starting from the earliest product planning stages. We have contributed architectural designs, significant portions of production code, as well as prototypes, algorithms and consulting services. Ongoing collaborations include designing and building distributed systems infrastructure as well as exploring new algorithms for result ranking, web spam detection, keyword auctions, etc.

Selected Publications

©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement