North Carolina’s Research Triangle Park often plays host to scientific conferences. From April 26-30, the city of Raleigh and the Raleigh Convention Center will be the venue for the 2010 World Wide Web Conference (WWW2010), an event that brings together some of the most influential thinkers about the Web. Organized by the International World Wide Web Conferences Steering Committee, discussion topics for the conference have evolved since its inception in 1994 to reflect the Web’s progress and its impact on culture and technology.
One of the conference’s keynote addresses will be presented by danah boyd, researcher at Microsoft Research New England. She will discuss the shifting definitions of privacy in the public landscape and the complex, intertwined ways that privacy and publicity operate in social networks. boyd’s talk will provide a framework for understanding how privacy and publicity are changing and the implications for designers, developers, scholars, and participants.
During WebSci10, a collocated conference being held April 26-27 in Raleigh, Jennifer Chayes, managing director of Microsoft Research New England, will speak on dynamical random networks and how such network models have become increasingly appropriate for describing relationships in disciplines from economics to medicine.
The WWW theme this year is “openness”—a topic relevant to many areas of Web research. It’s obvious that as the Web grows in reach and size, one of the constant challenges of maintaining an open environment is the ability to find information in a friendly, efficient, timely manner.
“What is very exciting for us this year,” says Yi-Min Wang, director of the ISRC, “is that all six papers describe work that has contributed to the Microsoft Online Services division, most notably Bing Search. They are either in the current version of Bing or have been recognized as important longer-term work that will influence future development.”
The ISRC was created in 2007 as an applied research group within Microsoft Research, with a mandate to accelerate technology transfer from research to product groups. Collaboration with the ISRC has enabled the Bing team to apply research expertise and pursue higher-risk projects, while researchers gain access to engineering support and field data to test their algorithms.
“It is a great experience to work so closely with Microsoft Research,” says Zijian Zheng, development manager with Bing. “Soon after the researchers arrive at definitive conclusions, we have new solutions ready for deployment. It’s the ultimate example of agile R&D.”
Kuansan Wang, principal researcher from the ISRC, agrees.
“As search engines utilize increasingly sophisticated mathematical solutions,” Wang says, “we find the gap shrinking between research and product. Joint efforts are becoming commonplace and, in our view, necessary.”
One of the more user-visible contributions to the Bing search engine is all about friendlier results: a paper called Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries, is written by Xiaoxin Yin, Wenzhao Tan, and Yi-Chin Tu of the ISRC, along with Xiao Li of Microsoft Research Redmond. Search engines deliver results as a list of “snippets,” which users read and click if the snippet’s text appears to represent a Web page that fulfills the intent of their query. How much more useful would it be if users could see results that provided more direct answers?
The challenge for the team was to extract highly relevant structured information in an automated, scalable manner. Their approach was to mine Web pages and logs of search trails to identify structured data and post-search user behavior. This enabled them to extract more structured data to help answer queries. As a result, instead of seeing just a list of Web pages that contain the name “Beethoven,” users can get a page of results topped by a link to the composer’s bio, photos, and a brief list of recordings.
Another project that contributed to more user-friendly, direct answers is Building Taxonomy of Web Search Intents for Name Entity Queries, by Yin of the ISRC, and Sarthak Shah from the Bing team. They leveraged the fact that the majority of Web-search queries involve names—actors, musicians, celebrities, places, and businesses. Furthermore, users generally intend to search for categories of information, such as a celebrity’s biography, movies, albums, or live appearances. Yin and Shah wanted to devise an automatic way to recognize generic search intents for name-entity queries and accurately classify them. This would enable a search engine to deliver results organized by popular user intent.
The quest for better search data led to Large-Scale Bot Detection for Search Engines, jointly authored by Hongwen Kang of Carnegie Mellon University; Kuansan Wang; and David Soukal, Fritz Behr, and Zijian Zheng from the Bing team. To prevent skewed results, researchers must filter out queries submitted by automated Web bots before attempting to mine search logs. Auto-queries originate from any number of sources: from malicious users seeking vulnerabilities in servers to search-engine-optimization companies testing their latest schemes. Traditional approaches for identifying bot queries have been based on heuristic models that required manually annotated training data samples, an expensive, time-consuming effort that grows more unwieldy as search-log volumes increase. The team noticed that human users will click through a few URLs on a list, refine their search, or perform some other observable actions, such as click through. Bots, however, exhibit none of these behaviors. The researchers took advantage of this fact to devise a learning approach for identifying bot-generated Web-search traffic from that of real users, resulting in a more efficient method of generating training data.
Optimal Rare Query Suggestion with Implicit User Feedback, which has longer-term implications for search techniques, is authored by Yang Song, and Li-wei He, both of the ISRC. The paper offers a framework for more effective searches on rare queries. Because query logs for rare queries contain less information—clicks—than do popular queries, it is much harder to provide relevant query suggestions to help users refine their searches. In the model described in the paper, the researchers use both clicked and skipped URLs. The information is combined to correlate rare queries with similar search intent in an optimal way, resulting in an approach that delivers better accuracy and performance than previous models for handling rare queries.
While other ISRC papers at WWW2010 deal with specific search problems, Distributed Non-Negative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduceaddresses computational infrastructure. Co-authors Chao Liu, Hung-chih Yang, Jinliang Fan, He, and Yi-Min Wang studied the problem of large-scale computations on distributed computer clusters, specifically decomposing the matrix data found on Web pages such as term by document, a count of a certain term within a document; user-by-tagged images on photo-sharing sites; or friend relationship on social networks. Non-negative matrix factorization (NMF) is a well-known technique for coping with matrix data, but the scale of matrices on the Web is staggering: Million-by-billion matrices are commonplace. The team presents a new, successful approach to scaling NMF over thousands of machines using Microsoft’s Structure Computations Optimized for Parallel Execution technology, designed for cloud-scale services.
“Although we are presenting this technique in a Web context,” Liu says, “it can be packaged for other applications where NMF is widely used, such as medical genome/microarray analysis.”
Results of research from Exploring Web Scale Language Models for Search Query Processing demonstrate that close collaboration between researchers and the Bing team has proved extremely fruitful. Co-authors Jian Huang of Pennsylvania State University; Jianfeng Gao of Microsoft Research Redmond; Xiaolong Li and Kuansan Wang from the ISRC; and Jiangbo Miao and Behr from Bing investigated the differences between the language style of user queries and that of Web documents. Conventional wisdom holds that different parts of a Web document are composed in language styles that exhibit subtle differences. By creating large-scale language models based on Web documents indexed by Bing, the team was finally able to describe these differences in quantitative terms and to study the effectiveness of various query-processing algorithms. The language models have proved valuable in so many other research efforts that Microsoft Research decided to release them to the academic community.
This year’s WWW conference is extremely significant for the ISRC team, not only because its papers form part of a large Microsoft contribution, but also because Microsoft Research will be releasing its Web-scale language model to the academic community.
“We are very excited to announce the public beta of the Microsoft Web N-gram Services during WWW 2010,” says Kuansan Wang, who is managing the project. “We invite the whole community to take advantage of this resource for research in Web search, natural language processing, and related areas.”
Currently in beta to selected participants, the Web-scale language model’s algorithms, implementation, and petabytes of data will be made available as a service to the research community via a cloud-based platform. By accessing the model as a service, researchers can avoid the cost and logistics of having to implement the model themselves, can conduct research on real-world Web-scale data sets, and can take advantage of regular data updates for projects that benefit from dynamic data.
“Yes, the ISRC has an applied-research mandate.” Yi-Min Wang says. “We also have the same commitment as any other group within Microsoft Research, which is to advance the state of the art through sharing sound scientific research. I like to think we have succeeded in both areas. ”
Technical papers for WWW2010 authored or co-authored by Microsoft Research. Boldfaced names denote Microsoft Research personnel.
Actively Predicting Diverse Search Intent from User Browsing Behaviors
Zhicong Cheng, Bin Gao, Tie-Yan Liu
A Pattern Tree-Based Approach to Learning URL Normalization Rules
Rui Cai, Lei Zhang
Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries
Xiaoxin Yin, Wenzhao Tan, Xiao Li, Yi-Chin Tu
b-Bit Minwise Hashing
Ping Li, Christian Konig
Building Taxonomy of Web Search Intents for Name Entity
Xiaoxin Yin, Sarthak Shah
Paul N. Bennett, Krysta Svore, Susan Dumais
Collaborative Location and Activity Recommendations with GPS History Data
Vincent W. Zheng, Yu Zheng, Xing Xie, Qiang Yang
Cross-Domain Sentiment Classification via Spectral Feature Alignment
Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, Zheng Chen
Distributed Non-Negative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce
Chao Liu, Hung-chih Yang, Jinliang Fan, Li-wei He, Yi-Min Wang
Equip Tourists with Knowledge Mined from Travelogues
Qiang Hao, Rui Cai, Changhu Wang, Lei Zhang
Exploiting Social Context for Review Quality Prediction
Yue Lu, Panayiotis Tsaparas, Alex Ntoulas, Livia Polanyi
Exploring Web Scale Language Models for Search Query Processing
Jian Huang, Jiangbo Miao, Xiaolong Li,Jianfeng Gao, Kuansan Wang
Large-Scale Bot Detection for Search Engines
Hongwen Kang, Kuansan Wang, David Soukal, Fritz Behr, Zijian Zheng
Optimal Rare Query Suggestion With Implicit User Feedback
Yang Song, Li-wei He
Smart Caching for Web Browsers
Zhang Kaimin, Wang Lu, Pan Aimin, Bin Zhu
Statistical Models of Music-Listening Sessions in Social Media
Elena Zheleva, John Guiver, Eduarda Mendes Rodrigues, Natasa Milic-Frayling
Visualizing Differences in Web Search Algorithms Using the Expected Weighted Hoeffding Distance
Mingxuan Sun, Guy Lebanon, Kevyn Collins-Thompson