Web Search and Data Mining (Silicon Valley)

We are conducting numerous projects aimed at improving web search. Our projects range from developing core systems infrastructure, to developing novel algorithms and heuristics for ranking and classifying web pages, to study basic properties of the web at large, to mining query logs for temporal patterns.

Active Projects
  • Automatic Text Pop-Up for Web Images
    We create Bing homepage style text pop-ups automatically for most of the interesting images on the web
  • Entwining Structure into Web Search
    In web search today, a user types a few keywords and gets back links to web pages consisting of unstructured data. This leaves a lot to be desired for when there is structure data stores or the user includes some structural semantics in their query. With our work we aim to allow web results to include information from structured data sources ranging from fully relational databases, to flat tables and XML files to hidden data accessible only via web forms. Additionally, we aim to automaticall
  • Privacy Integrated Queries (PINQ)
    Privacy Integrated Queries is a LINQ-like API for computing on privacy-sensitive data sets, while providing guarantees of differential privacy for the underlying records. The research project is aimed at producing a simple, yet expressive language about which differential privacy properties can be efficiently reasoned and in which a rich collection of analyses can be programmed.
  • Social Algorithms at MSR-SVC
    This is an umbrella project for several related efforts at Microsoft Research Silicon Valley in the areas of social network analysis and human computation.
  • WISE: Large Scale Web Image Search and Exploration
    Our goal is to build a web-scale content based image retrieval system. We are addressing two major challenges by harnessing the distributed computing power in MSR-SVC: 1) large scale machine learning for image representation and , 2) efficient image indexing and query. The following are several ongoing projects within WISE.

Other Work

Accelerated Link-based Ranking Computations
A class of query-independent link-based ranking algorithms operate on the entire web graph, or at least on the fraction that is known to a search engine. Moreover, many of these algorithms use iterative methods to compute a fixed point of the ranks of all known pages. Given the enormous size of typical search engine corpora, this computation can be computationally challenging. We have developed algorithms for greatly accelerating such computations, and have implemented them at scale. We are in the process of exploring further algorithmic and implementation optimizations. A paper describing this work appeared at WWW 2005.

Investigation of Blogs
Given the rising popularity of blogging, we are currently investigating the impact of blogs on the web at large. In particular, we are interested in the evolution of blogs and their interconnectedness

Link-based Ranking Features
We have investigated the effectiveness of various link-based features for ranking web search results, using a large web graph with close to 3 billion nodes and close to 18 billion edges, and a sizeable test set consuisting of more than 28000 queries with partially labeled results. We compared the performance of query-independent link-based features such as web page in-degree and PageRank to that of query-dependent features such as HITS, SALSA and others. Papers describing our findings appeared at SIGIR 2007, CIKM 2007, WAW 2007, CIKM 2008 and WSDM 2009.

Query Log Mining
Query logs contain vast amounts of useful information: They indicate current trends in user interests, can be used to guide automatic spelling correction, and are a source of associated queries. We have mined query logs for temporal patterns, and have found that queries with similar temporal distributions tend to be semantically related. A paper describing this work appeared at WWW 2005.

Scalable Hyperlink Store
The Scalable Hyperlink Store is a specialized database for storing the graph induced by web pages and hyperlinks between them. It is designed to be highly scalable (i.e. capable of holding the entire graph induced by the Bing corpus) and to allow microsecond-range access to nodes and edges in that graph. Performance is achieved by maintaining a highly compressed representation of the graph in memory, while scalability is achieved by distributing the graph over a cluster of machines. A paper describing the architecture appeared at HT 2009.

Tie-aware Information Retrieval Performance Measures
We have studied the problem of measuring the retrieval performance of Information Retrieval Systems (such as web search engines) in the presence of tied scores. Tied scores arise commonly when one tries to assess the performance of a single discrete feature, such as in-link count, query-result click-through, or page visits. Standard definitions of most performance measures do not consider the possibility of ties. We have defined tie-aware variants of six common measures (precision, recall, F-measure, mean average precision, mean reciprocal rank, and normalized discounted cumulative gain) that conceptually average performance over all well-sorted permutations of a result-vector, and yet are about as efficient as the standard, tie-oblivious versions. You can download C# implementations of both tie-oblivious and tie-aware versions of these measures. A paper describing this work appeared at ECIR 2008.

Web Spam Detection
Given the significant amount and fast growth of web-based commerce and the crucial role of search engines in directing monetizable traffic to web sites, it is not surprising that some web site operators try to improve their traffic by publishing web pages that are targeted at search engines, but are useless to human viewers. This practice is commonly known as web spam. Web spam is a nuisance for both web searchers and search engines. We are investigating heuristics for identifying spam web pages. Papers describing this work appeared at WebDB 2004, SIGIR 2005, and WWW 2006.