Microsoft Research announced the twelve recipients of the Microsoft Live Labs: Accelerating Search in Academic Research 2006 awards, totaling $500,000 (USD) in funding. The objective of this award is to support Live Labs’ collaboration with the academic research community and is focused on the Internet Search research area. Specifically, this award directly addresses the need for more large-scale data by making additional real world search data available to academia. In doing so, Microsoft seeks to further encourage academic research and innovation in search by increasing the availability of relevant, large, and current data sets from MSN Search, new data analysis and algorithm development in Internet Search will be supported.
We propose to use the query logs and click-through data to analyze and visualize the interaction between user behavior, distribution of content, and search engine ranking. In particular, we will be analyzing the completeness of the information retrieved by the search engine user, an important factor, for example, when the query is health related. We will extract the most common health-related queries and use both human experts and natural language processing to identify key facts located on the Web pages returned by the search engine. We will then correlate the search engine ranking with the completeness of information on the page. Our main goal is to develop a visualization tool that will show the distribution of information among the search results, the links between the results and the user click-throughs. The visualization tool will both contribute to our understanding of information seeking behavior and enable search engine developers and Web site designers to pinpoint the difficulty users have in finding comprehensive information.
The flood of queries coming into a search engine represents a slice of the collective
consciousness of Internet users. Events in this stream, when properly detected and
aggregated, can be used to explain current happenings and generate leading indicators
to predict future events. We are working on Vinegar*, a system capable of analyzing
streams of search data, to find correlations and causal inferences. Our goal is
that Vinegar be able to accurately generate useful indicators in near real-time
through both automatic and manually-guided means. By analyzing search logs in conjunction
with other temporal information (such as news events or blog posts), we hope to understand
how query behavior is impacted by external events and, conversely, how aggregate
search behavior can be predictive of events and trends in other domains.
*The name Vinegar comes from the observation that months before SARS hit the world newspapers, and even before the disease was acknowledged by the larger Chinese medical community, the affected population of the Guandong province in China began buying out supplies of white vinegar, a local folk remedy.
The goal of our proposed project is to dramatically improve the quality of complex search and aggregation tasks over text and semi-structured data by annotating and exploiting entities and relations. We will explore several means to this end. First, we wish to devise algorithms which, guided by query log analysis, will create and maintain catalogs of entities, attributes, and relations. Second, we plan to unify and extend existing information extraction and integration techniques for cross-site, cross-page annotations that combine links, layout, and text. Third, we plan to design practical, compact and efficient indexes that support queries combining keywords with structures in a knowledge base or ontology. Fourth, we want to invent scoring functions that span linear text, 2D layouts, and graphical knowledge bases, and that can be trained automatically through relevance feedback.
In the recent years, the Web has been rapidly deepened with the prevalence of databases online. While the “surface Web” has linked billions of static HTML pages, a far more significant amount of information is hidden in the “deep Web,” behind the query forms of searchable databases. As the deep Web is largely invisible to current search engines, users’ search requests do not reach this uncharted territory. This proposal aims at opening up the deep Web, by extending users’ Web search, beyond scratching the surface Web (as currently covered), into the deep Web. We aim at providing a Deep Web Search System by directing users to online query forms as “dynamic links” into the deep Web, with not only where these “doors” are, but also what might be inside there. We will develop this facility in the context of the overall MetaQuerier project.
Many queries, particularly “content-based” Web queries, contain terms that are difficult to match directly with documents. We believe that many of these important terms are in fact instances, examples, or more specific forms of query terms which we call “meta-terms.” Transforming queries using replacements or expansions for these terms can make a substantial difference to performance. In this research, we will use both the Microsoft query logs and the TREC GOV2 collection to develop techniques to discover meta-terms in queries and then mine related words from the Web. The meta-term dictionary developed using these techniques will then be used to carry out retrieval experiments and to test various approaches to query reformulation or transformation. Evaluation will be done with the query log and click-through data, and the TREC data will provide some solid baseline performance figures.
The Web has become a battleground for control over search engine results. Search providers continually work to improve the quality of their product, while marketers strive for ever increasing visibility. Web link analysis is now well-targeted by search engine marketers, and so “web spam” has become increasingly visible in Web search. In this project, we incorporate a number of measures of trust and distrust to improve estimates of Web page and site authority, reducing or eliminating the effect of Web spam in the process.
Our aim is to model users, their relationships, and the information they seek, using the query logs provided by Microsoft Research Live Labs. We will use advanced methods from statistical machine learning, focusing particularly on fast approximate inference algorithms so that we can make efficient use of the vast data sets provided. Some of our specific aims include identifying trend-setters (users whose queries anticipate those of others), multi-task collaborative learning (leveraging other users to help personalized search), time series predictive modeling of click-through (predicting the next query and clicked page), and identifying clusters of users, of queries, and their network structure.
You might have bought something on eBay and left a short feedback posting, summarizing your interaction with the seller, such as “Lightning fast delivery! Sloppy packaging, though.” Similarly, you might have visited Amazon and written a review for the latest digital camera that you bought, such as “The picture quality is fantastic, but the shutter speed lags badly.” The Internet has facilitated many such information exchanges between buyers and sellers. For example, the exchange of news, personal viewpoints and opinions, product reviews, and purchase decisions are all being strengthened and extended in the context of the electronic markets. What is the economic value of these comments? Increasingly these information exchanges are having some business impact that is being reflected in one or more economic variables (for example, product sales, pricing premiums, profits) that can be measured to examine the effect of a particular information exchange. The comment about “lightning fast delivery” can enhance a seller’s reputation and thus allow the seller to increase the price of the listed items by a few cents, without losing any sales. On the other hand the feedback about “sloppy packaging” can have the opposite effect on a seller’s pricing power. Similarly, online reviews and conversations in blogs affect customers’ perception about the quality of different products, which in turn can affect the total sales for that product. Given the high volume of transactions that are completed on Internet based electronic markets, this can lead to a substantial change in firms’ profitability. This research studies the “economic value of text” in such online settings, focusing on three important and varied categories of information exchanges: reputation systems in electronic markets, product recommendations in online communities, and the impact of social media (search engines, wikis, and blogs) on sales. This research program combines established techniques from economics with text mining algorithms from computer science to measure the economic value of each text snippet and understand how textual content in these systems influence economic exchanges between various agents in electronic markets.
The Internet has changed the way people look for information. Users now expect the answers to their questions to be available through a simple Web search. Web search engine are increasingly efficient at identifying the best sources for any given keyword query and are often able to identify the answer within the sources. Unfortunately, many Web sources are not trustworthy because of erroneous, misleading, biased, or outdated information. In many cases, users are not satisfied with — or do not trust — the results from any single source and prefer checking several sources for corroborating evidence. The goal of this project is to provide an interface that aggregates query results from different sources in order to save users the hassle of individually checking query-related Web sites to corroborate answers. In addition to listing the possible query answers from different Web sites, the interface ranks the results based on the number, and importance, of the Web sources reporting them. The existence of several sources providing the same information is then viewed as corroborating evidence, increasing the quality of the corresponding information.
Web retrieval systems will be more effective if they dynamically adapt to the user’s information need according to how other users have responded to those same documents when they were returned in response to the same or similar previous queries. Access to the Microsoft query log data and click-through data will allow us to explore this conjecture. We plan to use the data to construct synthetic “user sessions” in which queries are combined with the matching click-throughs to establish a sequence of operations for presumed single-topic searches. We will then retrieve the clicked pages and judge them against our belief as to the nature of the underlying information need. It will then be possible to investigate the extent to which subsequent issuers of the same or similar queries could be given improved retrieval effectiveness, assuming a range of possible user models as indicated by the click-through information from earlier instances of that query. Finally, once we have built a model based on the synthetic “sessions” extracted from the Microsoft logs, we will carry out an experiment in which groups of users use (or not use) an enhanced system that makes use of previous click-through information to bias ranking orderings. The search behind this experiment will be based on access via the MSN Search Software Development Kit.
Social bookmark tools like del.icio.us are rapidly emerging on the Web. Unlike link-based search approaches à la PageRank, these systems provide personal recommendations based on input from similar users. This new paradigm will change the way we are interacting with the web within the next few years. In particular, it will require corresponding search functionality. Furthermore, these systems are more responsive to upcoming topics, which can thus earlier be discovered and actively promoted. Therefore, we will extend link-based search with social search, in order to provide enhanced functionality and multiple search paradigms for the Web.
Search accuracy is closely related to how precise and discriminative a user’s query is. Unfortunately, it is generally difficult for a user to know in advance whether a particular query would be effective due to problems such as ambiguity of many terms and possible mismatch between the terms used by document authors and the user. As a result, a user often needs to iteratively refine a query many times until eventually reaching a query that can return useful results — a process not only time consuming, but also often requiring a great deal of knowledge about the topic. However, for various kinds of reasons, different people may look for similar information, and if some users already went through the process of refining queries about a topic, we should be able to exploit their experiences to benefit other users who are searching for similar information, which we refer to as “collaborative search.” The goal of this project is to develop techniques to extract query refinement patterns from the query/click log data collected by a search engine to support collaborative search. The query/click log data, including users’ queries and viewed documents, contains much valuable knowledge about query refinement accumulated from many users and for all kinds of topics. We will apply statistical language models and text data mining techniques to elicit such knowledge and exploit it to automatically refine a user’s query or enable a user to refine a query more effectively. The techniques to be developed would enable a Web search engine to improve its search performance automatically over time as more and more user information is collected.