Workshop on December 12-13, 2008, Whistler, BC, Canada
I will discuss a relatively new problem of selecting the best content to display for a user visit on a site like Yahoo!, MSN, Digg etc. In particular, I will consider scenarios where the content pool to select from is dynamic with short article lifetimes. I will provide an in-depth discussion of modeling challenges (and some of our solutions) that arise in this scenario, viz, a) estimating click-through rates that exhibit both temporal and positional variations b) efficient explore/exploit schemes and c) personalization of content to individual users. Finally, I will end with discussion of some open problems in this area. Throughout, data from a content module published regularly on the Yahoo! Front Page will be used for illustration.
Search engine companies are gathering treasure troves of user-generated data. It has already been shown that such data can be used to directly improve the user's online experience. I will discuss some ideas as to what online search and advertising might look like a few years hence, in light of the algorithms and data we have now. Moving from future to present, I will outline some recent work done by researchers in the Text Mining, Search and Navigation team at Microsoft Research; the work in TMSN touches many aspects of online search and advertising.
Social networking sites such as Orkut, MySpace, Hi5, and Facebook attract billions of visits a day, surpassing the page views of Web Search. These social networking sites provide applications for individuals to establish communities, to upload and share documents/photos/videos, and to interact with other users. Take Orkut as an example. Orkut hosts millions of communities, with hundreds of communities created and tens of thousands of blogs/photos uploaded each hour. To assist users to find relevant information, it is essential to provide effective collaborative filtering tools to perform recommendations such as friend, community, and ads matching. In this talk, I will first describe both computational and storage challenges to traditional collaborative filtering algorithms brought by aforementioned information explosion. To deal with huge social graphs that expand continuously, an effective algorithm should be designed to 1) run on thousands of parallel machines for sharing storage and speeding up computation, 2) perform incremental retraining and updates for attaining online performance, and 3) fuse information of multiple sources for alleviating information sparseness. In the second part of the talk, I will present algorithms we recently developed including parallel Spectral Clustering , parallel PF-Growth , parallel combinational collaborative filtering , parallel LDA, parallel spectral clustering, and parallel Support Vector Machines .
Wikis and blogs have become enormously successful media for collaborative information creation. Articles and posts accrue information through the asynchronous editing of users who arrive both seeking information and possibly able to contribute information. Most articles stabilize to high quality, trusted sources of information representing the collective wisdom of all the users who edited the article. We propose a model for information growth which relies on two main observations: (i) as an article's quality improves, it attracts visitors at a faster rate (a rich get richer phenomenon); and, simultaneously, (ii) the chances that a new visitor will improve the article drops (there is only so much that can be said about a particular topic). Our model is able to reproduce many features of the edit dynamics observed on Wikipedia and on blogs collected from LiveJournal; in particular, it captures the observed rise in the edit rate, followed by (1/t) decay.
Machine learning and the Web are a technology and an application area made for each other. The Web provides machine learning with an ever-growing stream of challenging problems, and massive data to go with them: search ranking, hypertext classification, information extraction, collaborative filtering, link prediction, ad targeting, social network modeling, etc. Conversely, seemingly just about every conceivable machine learning technique has been applied to the Web. Can we make sense of this vast jungle of techniques and applications? Instead of attempting an (impossible) exhaustive survey, I will instead try to distill a unified view of the field from our experience to date. By using the language of Markov logic networks - which has most of the statistical models used on the Web as special cases - and the state-of-the-art learning and inference algorithms for it, we will be able to cover a lot of ground in a short time, understand the fundamental structure of the problems and solutions, and see how to combine them into larger systems.
Jason Hartline: Machine Learning, Market Design, and Advertising
Given the complexity of preferences in markets such as key word advertising it is hard to believe that the de facto standard, decentralized, local, greedy algorithm
(advertisers bid for clicks on keywords) is any where close to being optimal for any reasonable objective (welfare, profit, etc.). In this talk we consider the market design problem from a global perspective. We make connections between machine learning theory and market design theory, where machine learning design problems closely mirror game theoretic design problems. We reduce a general theoretical market design problem to a natural machine learning optimization problem. These theoretical results lead to a number of practical answers to advertising market design questions.
There are multiple sources of power available for forming and propelling automobiles; analogously, there are several sources of power for forming and propelling thoughts. Besides the neural ones you're most familiar with, and the Semantic Web ones that have received the lion's share of hype in recent years, there are some additional ones that we are tapping into with some success. These deep semantic representations and operations are able to produce useful and in cases even novel conclusions requiring induction, abduction, and analogy, as well as deductive reasoning. I will illustrate this with case examples from recent Cyc applications, including terrorism scenario generation for intelligence analysts and ad hoc clinical trial question answering for medical researchers.
Web searches tend to be short and ambiguous. It is therefore not surprising that Web query disambiguation is an actively researched topic. However, most existing work relies on the existence of search engine log data in which each user's search activities are recorded over long periods of time. Such approaches may raise privacy concerns and may be difficult to implement for pragmatic reasons. In this work, we present an approach to Web query disambiguation that bases its predictions only on a short glimpse of user search activity, captured in a brief session of about 5--6 previous searches on average. Our method exploits the relations of the current search session in which the ambiguous query is issued to previous sessions in order to predict the user's intentions and is based on Markov logic. We present empirical results that demonstrate the effectiveness of our proposed approach on data collected form a commercial general-purpose search engine.
This talk describes the optimal (revenue maximizing) auction for sponsored search advertising. We show that a search engine's optimal reserve price is independent of the number of bidders. Using simulations, we consider the changes that result from a search engine's choice of reserve price and from changes in the number of participating advertisers.
We spend a lot of our time online using web search services, but even the leading search engines frequently fail to deliver relevant results for the vague queries that are commonplace among today's web searchers. Interestingly, when we look at the search patterns of link-minded searchers (perhaps friends or colleagues) we do find considerable overlap between their queries and result-selections. This motivates a more collaborative approach to web search, one in which the past search experiences of friends and colleagues can be used to usefully influence our new searches. In this talk we will describe how a novel combination of case-based reasoning, web search, and peer-to-peer networking can be used to develop a platform for personalized web search, which benefits from better quality results, improved robustness against search spam, while offering an increased level of privacy to the individual user.
Today's applications like search & social networks, although powerful, are limited in their power due to the shortcoming of their underlying data stores. These information stores are application tuned, silo-ed, and not configured to deeply understand the content. In this talk, we discuss some ideas on evolving the underlying store into a potent knowledge base. We envision how an evolved web might look like in the future – an intelligent knowledge store that integrates information with inference, and a platform that can unleash new killer applications with ease. We outline some practical challenges that have to be solved by the machine learning community to propel this evolution.
We present an online learning framework tailored towards real-time learning from observed user behavior in search engines and other information access systems. In particular, we only require pairwise comparisons which were shown to be reliably inferred from implicit feedback [4, 3]. We will present an algorithm with theoretical guarantees as well as simulation results.
In the analysis of current online ad auctions essential parameters such as click-through rates are often assumed to be known. The disregard of the uncertainty that is present in reality leads to several serious problems. In the talk we will highlight two: (i) there is no principled exploration of new ads, and (ii) there is no incentive for advertisers to only subscribe to well targeted key-words. In fact, there is an interesting opportunity for very poorly targeting advertisers to exploit this fact. We present a new auction that solves both problems. The key trick for this auction is that advertisers are not only requested to submit a bid, but also a belief over their own click-through rate.