Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Web Search and Data Mining — Silicon Valley

We are conducting numerous projects aimed at improving web search. Our projects range from developing core systems infrastructure, to developing novel algorithms and heuristics for ranking and classifying web pages, to study basic properties of the web at large, to mining query logs for temporal patterns.

Projects

Corpus Selection
The Corpus Selection project is investigating the impact of various index selection methods on the measured quality of the resulting index. Papers describing this work appeared at SIGIR 2008, ECIR 2009, and SIGIR 2009.

Helix
In web search today, a user types a few keywords and gets back links to web pages consisting of unstructured data. This leaves a lot to be desired for when there is structure data stores that can provide more relevant results for such queries. With our work, we aim to analyze the web query and extract structured semantics, map it to the corresponding structured data sources, and modify the web ranking functions to incorporate the results from the structured data. These techniques are general and are applicable to diverse domains, such as shopping, movies, autos, and travel. In fact we have created a collaboration with our search engine product team and part of our work is already included in live search. A paper describing the architecture appeared at SIGMOD 2009.

Link-based Ranking Features
We have investigated the effectiveness of various link-based features for ranking web search results, using a large web graph with close to 3 billion nodes and close to 18 billion edges, and a sizeable test set consuisting of more than 28000 queries with partially labeled results.  We compared the performance of query-independent link-based features such as web page in-degree and PageRank to that of query-dependent features such as HITS, SALSA and others. Papers describing our findings appeared at SIGIR 2007, CIKM 2007, WAW 2007CIKM 2008 and WSDM 2009.

Privacy Integrated Queries (PINQ)
Privacy Integrated Queries is a LINQ-like API for computing on privacy-sensitive data sets, while providing guarantees of differential privacy for the underlying records. The research project is aimed at producing a simple, yet expressive language about which differential privacy properties can be efficiently reasoned and in which a rich collection of analyses can be programmed. A paper describing this work appeared at SIGMOD 2009.

Scalable Hyperlink Store
The Scalable Hyperlink Store is a specialized database for storing the graph induced by web pages and hyperlinks between them. It is designed to be highly scalable (i.e. capable of holding the entire graph induced by the Bing corpus) and to allow microsecond-range access to nodes and edges in that graph. Performance is achieved by maintaining a highly compressed representation of the graph in memory, while scalability is achieved by distributing the graph over a cluster of machines.  A paper describing the architecture appeared at HT 2009.

WISE: Large Scale Web Image Search
Our goal is to build a web-scale content based image retrieval system. We are addressing two major challenges by harnessing the distributed computing power in MSR-SVC: 1) large scale machine learning for image representation, and 2) efficient image indexing and query.  A paper describing this work appeared at CVPR 2009.

Inactive Projects

Accelerated Link-based Ranking Computations
A class of query-independent link-based ranking algorithms operate on the entire web graph, or at least on the fraction that is known to a search engine. Moreover, many of these algorithms use iterative methods to compute a fixed point of the ranks of all known pages. Given the enormous size of typical search engine corpora, this computation can be computationally challenging. We have developed algorithms for greatly accelerating such computations, and have implemented them at scale. We are in the process of exploring further algorithmic and implementation optimizations. A paper describing this work appeared at WWW 2005.

Investigation of Blogs
Given the rising popularity of blogging, we are currently investigating the impact of blogs on the web at large. In particular, we are interested in the evolution of blogs and their interconnectedness.

Nocturnal
Nocturnal is a system that provides automated information sharing between Messenger users. The Nocturnal application scenario is a collaborative web search tool. A paper describing this work appeared at P2P 2007.

Query Log Mining
Query logs contain vast amounts of useful information: They indicate current trends in user interests, can be used to guide automatic spelling correction, and are a source of associated queries. We have mined query logs for temporal patterns, and have found that queries with similar temporal distributions tend to be semantically related.  A paper describing this work appeared at WWW 2005.

Studies of Web Evolution
We have studied the evolution of several aspects of the web at large. In particular, we have investigated how much web pages change over time, and how stable sets of near-identical web pages are over time. A solid understanding of these properties is beneficial to search engines, which typically refresh their index continuously and need to decide on recrawl policies that optimizes content freshness and index quality under bandwidth constraints.  Papers describing this work appeared at WWW 2003 and LAWEB 2003.

Tie-aware Information Retrieval Performance Measures
We have studied the problem of measuring the retrieval performance of Information Retrieval Systems (such as web search engines) in the presence of tied scores. Tied scores arise commonly when one tries to assess the performance of a single discrete feature, such as in-link count, query-result click-through, or page visits. Standard definitions of most performance measures do not consider the possibility of ties. We have defined tie-aware variants of six common measures (precision, recall, F-measure, mean average precision, mean reciprocal rank, and normalized discounted cumulative gain) that conceptually average performance over all well-sorted permutations of a result-vector, and yet are about as efficient as the standard, tie-oblivious versions. You can download C# implementations of both tie-oblivious and tie-aware versions of these measures.  A paper describing this work appeared at ECIR 2008.

Web Crawling
Commercial search engines such as Bing assemble their corpus in a biased fashion, by attempting to maintain a fresh copy of high-quality pages. There are situations where the biased nature of such a corpus is undesirable, e.g. when studying statistical properties of portions of the web that search engines are biased against. To enable such studies, we have built a highly customizable, yet high performance research web crawler.

Web Spam Detection
Given the significant amount and fast growth of web-based commerce and the crucial role of search engines in directing monetizable traffic to web sites, it is not surprising that some web site operators try to improve their traffic by publishing web pages that are targeted at search engines, but are useless to human viewers. This practice is commonly known as web spam. Web spam is a nuisance for both web searchers and search engines. We are investigating heuristics for identifying spam web pages.  Papers describing this work appeared at WebDB 2004, SIGIR 2005, and WWW 2006.

Microsoft Product Engagement

We have been closely collaborating with Bing in the design and implementation of their algorithmic search service, starting from the earliest product planning stages. We have contributed architectural designs, significant portions of production code, as well as prototypes, algorithms and consulting services. Ongoing collaborations include designing and building distributed systems infrastructure as well as exploring new algorithms for result ranking, web spam detection, keyword auctions, etc.

Contributors
Mark Manasse
Mark Manasse

Dennis Fetterly
Dennis Fetterly

Frank McSherry
Frank McSherry

Marc Najork
Marc Najork

Rina Panigrahy
Rina Panigrahy