Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Tech Talks @ Search Labs

We organize invited talks by distinguished people in academia and industry. The talks are usually held on alternate Fridays. This page provides the summary of past talks as well as upcoming talks.

Contact: Abhimanyu Das




11/8/13  Blurring of the Boundary between Search and Recommendation  Ed H. Chi 
10/18/13  Using Buffering for Resource-Efficient Storage and Aggregation  Hrishikesh Amur 
12/10/12  The world inside words: information extraction and labeling in low resource languages through subword models  Robert Munro 
9/28/12  SNAzzy --- Social Network Analysis for Telecom Business Intelligence   Amit Nanavati 
9/21/12  Personal data in the mobile era  Stelios Paparizos 
1/20/12  Technology is Changing Education (Finally)  Jeff Ullman 
10/7/11  Recommender Systems: The Art and Science of Matching Items to Users  Deepak Agarwal 
8/19/11  Guided Interaction: Rethinking the Query-Result Paradigm  Arnab Nandi 
8/19/11  Structural Trend Analysis for Online Social Networks  Ceren Budek 
08/05/11  Social Recommendation with Matrix Factorization  Irwin King 
06/21/11  IBM Watson  Daniel Gruhl 
06/3/11      Mining Billion-node Graphs  Christos Faloutsos  
04/22/11  CrowdDB: Answering Queries with Crowdsourcing Tim Kraska 
03/18/11 Buy-it-Now or Take-a-Chance: A simple randomized auction mechanism  Elisa Celis 
07/20/10      Entity Extraction for Query Understanding Patrick Pantel 
05/11/10  Addressing New Challenges in Data Stream Processing Yanlei Diao 
05/05/10  Searching the Annotated Web Soumen Chakrabarti
04/16/10  Beyond Search? 5 steps to insight Pankaj Mehra 
04/09/10  Posterior Regularization  Kuzman Ganchev  
01/22/10  User-Generated Content using Econometrics Panagiotis G. Ipeirotis 
10/09/09  Efficient and Robust Processing of top-k Relational Queries Alkis Polyzotis  
09/11/09  Agents and Cooperation in Social Tagging Systems Markus Strohmaier 
08/21/09  Bidding: A New Search Advertising Model Complementary to Keyword Bidding Ali Dasdan 
08/12/09  Optional Polya Tree and Bayesian Inference Wing H Wong 

Why everyone should be their own database administrator, UI designer, and Web 2.0 site developer, and how they can

David Karger 
07/24/09  Generating Bayes-Nash Equilibria to Design Autonomous Trading Agents Ioannis Vetsikas 
06/12/09      Privacy into the Netflix Prize Contenders Ilya Mironov 
06/05/09  Finding experts and initiators in social networks Evimaria Terzi 
05/15/09  SOFIE - A Self-Organizing Framework for Information Extraction Fabian Suchanek 
05/08/09  Learning to maximize expected ranking gain for information retrieval and collaborative filtering Rich Zemel 
01/23/09  Information Overload: Models and Methods to Live with it Georgia Koutrika 
11/12/08  Probabilistic Models for Holistic Scene Understanding  Daphne Koller 
11/04/08  YAGO: A Core of Semantic Knowledge  Fabian M. Suchanek 
10/02/08  Morpheus: A Deep Web Search Engine  Michael Stonebraker 


Structured learning models for information extraction 

Sunita Sarawagi 

08/28/08 Structured Learning for Non-Smooth Ranking Losses 

Soumen Chakrabarti 


Context Analysis in Text Mining and Search  Qiaozhu Mei 


On the Optimization of Evaluation Metrics  Tom Minka 
08/08/08 OKKAM - A large-scale European Project for Entity-centric Information and Knowledge Management at Web Scale 

Heiko Stoermer

07/11/08 Beyond Search: The Data Must Find the Data and the relevance Must Find the User  Jeff Jonas

A Game-Theoretic Approach to Apprenticeship Learning 

Umar Syed 
06/20/08 A Learning Theory Approach to Non-Interactive Database Privacy  Aaron Roth 
06/20/08 Local Graph Algorithms  Reid Andersen 
Postponed Making The Sky Searchable: Large Scale Astronomical Pattern Recognition  Sam Roweis 
06/09/08 Longer Queries, Better Answers?  Bruce Croft 
06/06/08 Entity Search: Search for what you what  Tao Cheng 
05/27/08  Harvesting, Searching, and Ranking Knowledge from the Web  Gerhard Weikum 
05/19/08 Building scalable and fault-tolerant enterprise search platforms  Robbert Van Renesse 
03/21/08 Towards robust reputation and recommendation systems  Ashish Goel 
02/26/08 Help is Out There: Online Community, Community Artifacts, and a New Way of Harnessing Knowledge  Joe Konstan 
02/05/08 Keyword Search on Graph-Structured Data  S.Sudarshan 
12/03/07 Global Access to Information: Research Issues in Data Mining and Text Mining  Raj Reddy 
10/23/07 Lessons in Engineering Self-Managed Networks  Bruce Maggs 
10/19/07 Similarity Search: A Web Perspective  Yuri Lifshits 
08/28/07 A Top View of Side Channel Attacks  Adi Shamir 
08/24/07 VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams  Chen Li 
08/06/07 Discovering and Organizing Hidden-Web Sources  Juliana Freire 
08/01/07 BlogScope: A system for the analysis of the Blogosphere  Nick Koudas 
07/26/07 Algorithmic Mechanism Design  Noam Nisan 
07/19/07 Ranking for keyword queries in entity-relationship graphs: Machine learning and performance issues  Soumen Chakrabarti 
06/19/07 Finding the Real Patterns  Geoff Webb 
06/15/07 From Forum Spam to Syndication-based Spam Infrastructure Hao Chen 


Navigation-Aided Retrieval 

Chris Olston 
05/18/07 Exploring Heterogeneous Data bases and Web Data Sources with the OPM Multi-database Tools  Anthony Kosky 
05/11/07 Towards Personalization and Service Platforms  Ramakrishnan Akella 
04/11/07 Learning to Analyze Sequences  Fernando Pereira 
03/27/07 Universal Access to Human Knowledge (Or Public Access to Digital Materials)  Brewster Kahle 
10/26/06 Sub-search: Searching at Sub-document Granularity  Sudarshan Murthy 
10/18/06 Algorithms for Discovering Bucket Orders from Data  Heikki Mannila 


Discovering Semantic Themes and Social Networks from Email Corpus  Kwong Yung


11/08/2013 (10:30am SVC-6/Titan)

Title: Blurring of the Boundary between Search and Recommendation

Speaker: Ed H. Chi, Research Scientist, Google


As search became more social and personalized, search engines also have become more interactive. Interestingly, as a comparison, recommendation systems have always been social and personalized, as well as highly interactive, starting from the early Collaborative Filtering work led by my late advisor, John Riedl. But recommendation systems have also increasingly been more search-like by offering capabilities that enable users to tune and direct recommendation results instantly.

As the two technologies evolve toward each other, there is increasingly a blurring of the boundary between these two approaches to interactive information seeking. On the search side, some of this is driven by the merging of question answering capabilities with search, led by systems like Google with Google Now and Apple with Siri that move search toward intelligent personal assistants. On the recommendation side, it is a merging of techniques from not just keyword search but also faceted search with user-based and item-based collaborative filtering techniques and other more proactive recommendation systems.

This blurring has resulted in both critical re-thinking not just about how to architect the systems by merging and sharing backend components common to both types of systems, but also how to structure the user interaction and experience. How can we make progress on these new research problems? I will illustrate my thoughts on these problems by looking at our experience in building (1) a social search engine using Delicious social tagging data, (2) personalized newspaper and conversational recommender for Twitter, and (3) ways to use HCI experimental techniques, such as eyetracking, to understand social explanations for search and recommendation results.


Ed H. Chi is a Staff Research Scientist at Google, focusing on social interaction research relating to social search, recommendation, annotations, and analytics. Previous to Google, he was the Area Manager and a Principal Scientist at Palo Alto Research Center’s Augmented Social Cognition Group, where he led the group in understanding how Web2.0 and Social Computing systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota, and has been doing research on user interface software systems since 1993. He has been featured and quoted in the press, including the Economist, Time Magazine, LA Times, and the Associated Press.

10/18/2013 (10:30am SVC-6/Titan)

Title: Using Buffering for Resource-Efficient Storage and Aggregation

Speaker: Hrishikesh Amur, Georgia Tech


Recent years have witnessed a rapid growth in the popularity of fast analytics systems, which exemplify a trend where queries, that previously involved batch-processing (e.g., running a MapReduce job) on a massive amount of data, are increasingly expected to be answered in near real-time with low latency. In this talk, I will address the problem that existing designs for various components used in the software stack for DISC systems do not meet the requirements demanded by fast analytics applications. I focus specifically on two problems:

1. Key-value storage: Fast analytics applications require that new data entering the system (e.g., new web-pages crawled, currently trending topics) be quickly made available to queries and analysis codes. Therefore, along with fast reads, storage systems must also support writes with high throughput, which current systems fail to do. In the first part of this work, we solve this problem by proposing a new key-value storage system -- called the WriteBuffer (WB) Tree -- that provides nearly 30x higher write performance and similar read performance compared to current high-performance systems.

2. GroupBy-Aggregate: Fast analytics systems require support for fast, incremental aggregation of data for with low-latency access to results. Since existing techniques are memory-inefficient, in the second part of this talk, I will discuss a new data structure called the Compressed Buffer Tree (CBT) to implement memory-efficient in-memory aggregation.

Bio:  Hrishikesh Amur is a PhD Candidate in the School of Computer Science at Georgia Tech., advised by Dr. Karsten Schwan. His research interests broadly include operating systems and distributed systems. In particular, his recent work concerns systems and algorithmic techniques for resource efficiency in storage systems. He has enjoyed summer internships at Intel Labs Pittsburgh, the Parallel Data Lab at CMU, and IBM Research at Austin during his PhD. His work has most recently appeared at ACM SoCC and IEEE Cluster. He has accepted a full-time position at Google.

10/12/2012 (10:30am SVC-6/Titan)

Title: The world inside words: information extraction and labeling in low resource languages through subword models

Speaker: Robert Munro, Stanford


This talk will present recent and current work applying natural language processing to low resource languages. Low resource languages are any for which there is not a wide variety of available digital tools and services like machine translation engines, online dictionaries or encyclopedic articles. It is the fastest growing segment of the Internet and on the back of cell phone technologies it is estimated that there are now 5,000 languages spoken in the connected world.

Unlike written English, the spelling of words varies greatly in most languages. For example, in a set of 600 medical text messages in the Chichewa language there are more than 50 spellings for the word odwala ('patient') compared to just 2 for the English translations of the same messages ('patient' and the plural 'patients'). The variation comes from more or less phonetic spellings in non-standardized writing systems and from more extensive prefixing and suffixing possibilities.

The talk will be in three parts. First, the nature of subword variation will be discussed, finding that while variation is large in many languages it also follows predicable patterns, meaning that it can be modeled through unsupervised methods. Second, classification methods will be presented that can take advantage of the subword models to increase the accuracy of labeling systems, for example, classifying the medical text messages according to the types of illnesses mentioned. Finally the talk will present novel research in cross-linguistic named entity recognition, taking advantage of the fact that named entities have the least variation of all words across languages.


Robert is a computational linguist working in communication technologies, especially in less-resourced languages. His background includes working for the UN Refugee Commission in Liberia and Mission 4636 in Haiti, to Silicon Valley search-engines and start-ups. He is currently a graduate fellow at Stanford University.

09/28/2012 (10:30am SVC-6/Titan)

Title: SNAzzy --- Social Network Analysis for Telecom Business Intelligence

Speaker: Amit Nanavati, IBM Research


SNAzzy is for analyzing the social networks induced by calling patterns of the customers of Telecom Service Providers. It builds a social network of the telecom subscribers from the “who-calls-whom” CDR data. By analysing the social network thus constructed, we can derive several useful insights about the customers’ interaction amongst each others and its implications. SNAzzy can find communities; find targets for campaigns of socially influenced product, predict potential churners, determine the ideal targets for acquisition from competition and assign importance score to the customers. This can help identifying group targets for promotions and campaigns. SNAzzy can also analyse churn behaviour to identify potential churners. Using SNAzzy, one can identify potential acquisition targets from the competition.

Bio: Amit is a Senior Researcher and barely manages Telecom Solutions Research at IBM Research, India. He has been working on Telecom social network analysis (SNAzzy) since its inception in 2006. He is also a ”Spoken Web” evangelist – trying to promote the vision of a world-wide Spoken Web hosted in the Telecom network, which does not require an Internet connection or the ability to read and write. He is always interested in applying graph theory to various domains. He also dabbles with speech in mobile and pervasive environments. Along with Nitendra, he wrote a book on “Speech in Mobile and Pervasive Environments” published by Wiley in 2012. Much to his surprise and that of his friends, he was named a Master Inventor at IBM Research in 2011.

09/21/2012 (10:30am SVC-6/Titan)

Title: Personal data in the mobile era

Speaker: Stelios Paparizos, MSR SVC/Search Labs

Abstract: Mobile devices enable new type of application scenarios, not easily accessible in standard PCs. When it comes to personal data, we see the dawn of a new set of challenges and opportunities, both in terms of providing utility via mining, as well as building an efficient, portable and secure platform for storing and processing the personal data. In this talk, I will present example mobile application scenarios and discuss various design choices for an underlying system that can support them.

01/20/2012 (10:30am SVC-6/Titan)

Title: Technology is Changing Education (Finally)

Speaker: Jeff Ullman, Stanford


Gradiance is an automated homework system using "root questions." This type of question allows students to work typical CS homework questions, be tested for their knowledge of the total solution using random multiple-choice questions, receive help if they need it, and eventually master the material. We shall give some details of the system and also explain why it failed as a commercial venture (It has instead become a free service). Of greater interest are the efforts by several Stanford faculty to offer free on-line courses. The largest has 140,000 students. These courses are complete packages, involving on-line video that students watch according to a required schedule, quizzes (two courses using the Gradiance technology) and an on-line discussion group with social features, that seems to be working remarkably well.



10/07/2011 (10:30am SVC-6/Titan)

Title: Recommender Systems: The Art and Science of Matching Items to Users

Speaker: Deepak Agarwal, Researcher, Yahoo! Research


Algorithmically matching items to users in a given context is essential for the success and profitability of large scale recommender systems like content optimization, computational advertising, search, e-commerce, movie recommendation, and many more. The objective is to maximize some utility (e.g. total revenue, total engagement) of interest over a long time horizon. This is a bandit problem since there is positive utility in displaying items that may have low mean but high variance. A key challenge in such bandit problems is the curse of dimensionality. Bandit problems are also difficult to work with for responses that are observed with considerable delay (e.g. return visits, confirmation of a buy). One approach is to optimize multiple competing objectives in the short-term to achieve the best long-term performance. For instance, in serving content to users on a website, one may optimize some combination of clicks and downstream advertising revenue in the short-term to maximize revenue in the long-run. In this talk, I will discuss some of the technical challenges by focusing on a real-world recommender problem --- content optimization on the Yahoo! front page. Several techniques described in the talk are fully deployed on Yahoo! sites and have led to more than 200% improvement in click-rates on the Yahoo! front page.

This is joint work with Bee-Chung Chen, Raghu Ramakrishnan, Liang Zhang, Pradheep Elango, Rajiv Khanna, and Xuanhui Wang in Yahoo! Labs. We are deeply indebted to our partners in Yahoo! business, Yahoo! editorial, Yahoo! engineering, and Yahoo! products without whom the methods described would not be possible to deploy.


Deepak Agarwal is currently a principal research scientist at Yahoo! Research. Prior to joining Yahoo!, he was a member of the statistics department at AT&T Research. He obtained a Ph.D in statistics from the University of Connecticut; his thesis advisor was Alan Gelfand. At AT&T, he worked on methods for mining massive graphs, statistical models for social network analysis, anomaly detection using a time series approach and computational approaches for scaling spatial scan statistic to large data sets. His current research interests at Yahoo! include content optimization, large scale regression for massive, sparse and noisy data via "feature aggregation", anomaly detection in high dimensional spaces, explore/exploit and statistical methods for social network analysis. Deepak won a best research paper award at Joint Statistical Meetings 2001 for his thesis work which studied deforestation patterns in Madagascar using a two-stage spatial regression model, the best applications paper award at Siam Data Mining 2004 for his Bayesian modeling work on large sparse social networks via stochastic blockmodels and more recently the best research paper award at KDD 2007 for his work that propose a general class of models for large sparse dyadic data. He regularly serves on program committees of prestigious data mining conferences like KDD, SDM and has organized several invited sessions at Joint Statistical Meetings. He is Associate editor of Journal of the American Statistical Association and Applied Stochastic Models in Business and Industry.

08/19/2011 (10:30am SVC-6/Titan)

Title: Structural Trend Analysis for Online Social Networks

Speaker: Ceren Budek


The identification of popular and important topics discussed in social networks is crucial for a better understanding of societal concerns. It is also useful for users to stay on top of trends without having to sift through vast amounts of shared information. Trend detection methods introduced so far have not used the network topology and has thus not been able to distinguish viral topics from topics that are diffused mostly through the news media. To address this gap, we propose two novel structural trend definitions we call coordinated and uncoordinated trends that use friendship information to identify topics that are discussed among clustered and distributed users respectively. Our analyses and experiments show that structural trends are significantly different from traditional trends and provide new insights into the way people share information online. We also propose a sampling technique for structural trend detection and prove that the solution yields in a gain in efficiency and is within an acceptable error bound. Experiments performed on a Twitter data set of 41.7 million nodes and 417 million posts show that even with a sampling rate of 0.005, the average precision is 0.93 for coordinated trends and 1 for uncoordinated trends.


Ceren Budak is currently a PhD Candidate at the Department of Computer Science, University of California Santa Barbara. Her research interests lie in the area of information diffusion in social networks.

08/19/2011 (10:30am SVC-6/Titan)

Title: Guided Interaction: Rethinking the Query-Result Paradigm

Speaker: Arnab Nandi

Abstract: Decades of research, coupled with continuous increases in computing power, have enabled highly efficient execution of queries on large databases. For many databases, far more time is spent by users formulating queries than by the system evaluating them. It stands to reason that, looking at the overall query experience we provide users, we should pay attention to how we can assist users in the holistic process of obtaining the information they desire from the database, and not just the constituent activity of efficiently generating results given a complete precise query.

In this talk, we examine the conventional query-result paradigm employed by databases and demonstrate challenges encountered when following this paradigm. We recognize that the process of query specification itself is a major stumbling block. With current computational abilities, we are at a point where we can make use of the data in the database to aid in this process. To this end, we propose a new paradigm, guided interaction, to solve the noted challenges, by using interaction to guide the user through the query formulation, query execution and result examination processes.

There are significant engineering challenges to constructing the system we envision, and the technological building blocks to address these challenges exist today 

08/05/2011 (10:30am SVC-6/Titan)

Title: Social Recommendation with Matrix Factorization

Speaker: Irwin King, AT&T Labs Research and Department of Computer Science and Engineering The Chinese University of Hong Kong

Abstract: Recently Social Recommendation has emerged as one of the hot research topics as information being generated on the Internet grows exponentially. Social Recommendation forms a specific type of information filtering technique that attempts to suggest information (blogs, news, music, travel plans, web pages, images, tags, etc.) that are likely to interest the users. In this talk, I plan to present some of our recent works we have developed for Social Recommendation that are based on a matrix factorization framework. I will introduce novel ways on how we can use social ensemble, trust relations, tags, click-through-rate, etc. to improve social recommender systems for a wide-range of potential Internet applications.

Bio: Irwin King's research interests include machine learning, social computing, web intelligence, and multimedia processing. In these research areas, he has over 220 technical publications in journals and conferences such as SIGIR, WSDM, KDD, NIPS, IJCAI, AAAI, etc. In addition, he has contributed over 20 book chapters and edited volumes. Moreover, Irwin has over 30 research and applied grants. One notable system he has developed is the VeriGuide System, which detects similar sentences and performs readability analysis of text-based documents in both English and in Chinese. He has served as reviewer and panel member for RGC Hong Kong, Natural Sciences and Engineering Research Council of Canada (NSERC), National Natural Science Foundation of China (NSFC), and Natural Science, and Engineering of Academy of Finland. Irwin is an Associate Editor of the IEEE Transactions on Neural Networks (TNN). He is a member of the Editorial Board and Special Issue Editor of a number of international journals. He is Professor at the Department of Computer Science and Engineering, The Chinese University of Hong Kong. He is currently on leave from CUHK to be with AT&T Labs Research, San Francisco and also a Visiting Professor at UC Berkeley. He received his B.Sc. degree in Engineering and Applied Science from California Institute of Technology, Pasadena and his M.Sc. and Ph.D. degree in Computer Science from the University of Southern California, Los Angeles.

06/21/2011 (10:30am SVC-6/Titan)

Title: IBM Watson

Speaker: Daniel Gruhl, IBM

Abstract: The IBM "Watson" system proved a new approach in question answering can prove extreamly successful in some domains (e.g., outperforming the world's two best Jeopardy! players at their own game). This talk will provide an overview of the system, discuss the approach and underlying technologies, and examine some of the challenges of bringing this technology into new domains.

Bio: Dr. Gruhl is a Research Staff Member in the Computer Science Department at the Almaden Research Center. He received his B.S('94), M.Eng.('95) and PhD('00) degrees from the Massachusetts Institute of Technology in Electrical Engineering and Computer Science. He subsequently joined IBM at the Almaden Research Center where he has worked on large scale text analytics systems. He has received Outstanding Technology Awards for both the WebFountain system and the UIMA standard for text analytics. He is an author or coauthor of at least a dozen patents and two dozen papers. Dr. Gruhl is a member of the IEEE and ACM. Currently, Dr. Gruhl's focus in on text analytics of the clinical record in health care settings.

06/03/2011 (10:30am SVC-6/Titan)

Title: Mining Billion-node Graphs

Speaker: Christos Faloutsos (CMU)

Abstract: What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (like, e.g., ``eigenSpokes''). For generators, we describe some recent ones, which naturally match all of the known properties of real graphs. Finally, for tools, we present ``oddBall'' for discovering anomalies and patterns, as well as an overview of the PEGASUS system  which is designed for handling Billion-node graphs, running on top of the ``hadoop'' system.

Bio: Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the SIGKDD Innovations Award (2010), seventeen ``best paper'' awards, (including two ``test of time'') and four teaching awards. He has served as a member of the executive committee of SIGKDD; he is an ACM Fellow; he has published over 200 refereed articles, 11 book chapters and one monograph. He holds five patents and he has given over 30 tutorials and over 10 invited distinguished lectures. His research interests include data mining for graphs and streams, fractals, database performance,and indexing for multimedia and bio-informatics data.

04/22/2011 (10:30am SVC-6/Titan)

Title: CrowdDB: Answering Queries with Crowdsourcing

Speaker: Tim Kraska, UC Berkeley

Abstract: Some queries cannot be answered by machines only. Processing such queries requires human input, e.g., for providing information that is missing from the database, for performing computationally difficult functions and for matching, ranking, or aggregating results based on fuzzy criteria. In this talk, I am going to describe the design and implementation of CrowdDB, a database system that uses human input via crowdsourcing to process queries that neither database systems nor search engines can answer adequately. CrowdDB uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many concepts from traditional database systems, there are also important differences. From a conceptual perspective, the traditional closed-world assumption for query processing no longer holds for human input, which is essentially unbounded. From an implementation perspective, CrowdDB uses an operator-based query engine but this engine is extended with special operators that generate User Interfaces in order to solicit, integrate and cleanse human input. Furthermore, performance and cost depend on a number of different factors including worker affinity, training, fatigue, motivation and location. Real-life experiments conducted using Amazon Mechanical Turk show that CrowdDB is indeed able to process queries that cannot be processed with traditional systems.

Bio: Tim Kraska is a PostDoc in the AMP Lab which is part of the Computer Science Division of UC Berkeley. Currently, his research focuses on the intersection between data management for analytics, machine learning, and crowd sourcing. Before joining UC Berkeley, Tim Kraska received his PhD from ETH Zurich, where he worked on transaction management and stream processing in the cloud as part of the Systems Group. He also holds a Master in Information Technology from the University of Sydney, Australia, as well as an MSc from the University of Muenster, Germany.

3/22/2011 (10:30am SVC-6/Titan)

Title: Buy-it-Now or Take-a-Chance: A simple randomized auction mechanism

Speaker: Elisa Celis, University of Washington

Abstract: We first consider bid data from Microsoft’s AdECN platform, which allows advertisers to target their bids. We find that the valuations are irregular, making the commonly used Vickrey auction unsuitable for extracting revenue. In addition, we show that, Myerson's optimal auction mechanism is unsuitable in practice.

As such, we present and discuss the benefits of a simple auction mechanism (BIN-TAC) which extends the second-price auction with reserve. This mechanism is particularly effective in private value environments where the distribution of valuations are irregular, and is truthful in expectation. We use the AdECN data in counterfactual experiments that suggest our BIN-TAC mechanism would improve revenue by 12% over an optimal 2nd price auction, and by 24% relative to the AdECN mechanism.

Bio: Elisa is a third year Ph.D. student at the University of Washington and is advised by Anna Karlin and Yuval Peres. She is currently finishing her second MSR internship; first with Adam Kalai at MSR New England and now here with Udi Wieder. She is interested in solving problems which lie at the intersection of theoretical computer science and game theory, specifically with applications to social and economic networks. A random sample of her research interests include random-turn games, social contagion models, Bayesian learning on networks, ad auctions and mechanism design, and local dynamics in exchange networks.

20/07/2010 (10:30am SVC-6/Titan)

Title: Entity Extraction for Query Understanding

Speaker: Patrick Pantel, Microsoft Research

Abstract: Entity lists are vital for semantically analyzing web queries. In this talk, we propose a general information extraction framework, showing large gains in entity extraction by combining state-of-the-art distributional and pattern-based extractors with a large set of features from a 600 million document webcrawl, one year of query logs, and a snapshot of Wikipedia. We explore the hypothesis that although distributional and pattern-based algorithms are complementary, they do not exhaust the semantic space; other sources of evidence can be leveraged to better combine them. A detailed analysis of feature correlations and interactions shows that query log and webcrawl features yield the highest gains, but easily accessible Wikipedia features also improve over current state-of-the-art systems. We further study the impact of editor-chosen seeds on extraction performance. We show that in general few seeds are needed to saturate a distributional model and that seed compositionality is very sensitive resulting in tremendous variance on expansion performance. We further study the latter and show that untrained editors are terrible at choosing the right seeds and we propose an algorithm for helping editors choose better seeds.

Bio: Patrick Pantel is a Senior Researcher at Microsoft Research and a Research Assistant Professor at the USC Information Sciences Institute, where he conducts research in large-scale natural language processing, text mining, and knowledge acquisition. Prior he served as a Senior Researcher and Senior Research Manager at Yahoo! Labs. In 2003, he received a Ph.D. in Computing Science from the University of Alberta in Edmonton, Canada.

05/11/2010 (10:30am SVC-6/Titan)

Title: Addressing New Challenges in Data Stream Processing

Speaker: Yanlei Diao, UMass-Amherst

Abstract: Data stream processing has found application in many areas including environmental monitoring, object tracking and monitoring, network monitoring, and business analytics. While the foundation for data stream processing has been developed in previous work, recent real-world deployments are raising a host of new challenges.

The first challenge that we address regards uncertain data streams, where data can be incomplete, imprecise, and even misleading. Feeding such data streams to existing stream systems produces results of unknown quality. In the main part of the talk, I present the design of a data stream system that captures data uncertainty from data collection to query processing to final result generation, with a focus on its data model and processing algorithms for complex relational operators. Other challenges to data stream systems include the need to extend the data model from set-based to sequence-based and the need to archive and index data streams to answer continuous queries. In the rest of the talk, I survey two other projects that address these challenges.

Bio: Yanlei Diao is an Assistant Professor at the Department of Computer Science, University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on data streams, uncertain data management, flash memory databases, and large-scale data analysis. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in China in 1998.

Yanlei Diao is a recipient of the NSF Career Award and finalist for the Microsoft Research New Faculty Fellowship. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin in December 2005. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention. She has served on the program committees for many international conferences and on the organizing committees for SIGMOD and DMSN. She is a main contributor to YFilter 1.0, a high-performance filtering system over XML message streams.

05/05/2010 (10:30am SVC-6/Titan)

Title: Searching the Annotated Web

Speaker: Soumen Chakrabarti, IIT Bombay, India

Abstract: Over 99% of queries to Web search engines contain a noun, often referring to an entity. Catalogs like WordNet and Wikipedia list millions of well-known entities. Bootstrapping techniques may help us expand that to hundreds of millions of people, millions of locations, books, songs, and other artifacts. The second decade of Web search will represent, index, query and rank in a fine-grained graph-structured setting where dozens to hundreds of tokens on each Web page may be linked to entities, which in turn have attributes as well as types, subclass and other relational linkages. We will review recent techniques for entity annotation for linear text and HTML tables, and efficient annotation indexing. We will propose a low-level language to query the annotated Web, and discuss some ranking problems associated with such queries.

Bio: Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors. He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Clever Web search project and led the Focused Crawling project. In 1999 he joined the Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he was Visiting Associate professor at Carnegie-Mellon University. His current research interests include integrating, searching, and mining text and graph data models, exploiting types and relations in search, and dynamic personalization in graph-based retrieval and ranking models.

04/16/2010 (10:30am SVC-6/Titan)

Title: Beyond Search? 5 steps to insight

Speaker: Pankaj Mehra, HP Labs Russia

Abstract:  Evidence points to the reason why blind application of Web search to enterprises produces undesirable outcomes. First, enterprises are lagging the Web in achieving richly connected information. Second, the level of specificity of meaning and the depth of modeling expected by enterprise users are both higher than by consumers. In this talk, I will show the first houses we have built on the semantic foundations for enterprise search we laid in our previous work.

The talk will begin with an analysis of how people, processes and infrastructure are currently deployed in information-rich businesses in order to make sense of tens of petabytes of open and proprietary information. I will discuss why existing architectures—especially, their implied cost and delay structures—do not scale to the demands and opportunities thrown open by new economic and business models around information. I will then argue that in order to architect for exabytes and beyond, businesses need to make a switch to architecting their information services around an economy of plenty (away from architecting around the economy of dearth that gave us search).

In the process, we will turn on its head the very problem that motivates search technology, for instance, asking and answering the fundamental question: Is it not already too late if you have to look for something? I will present architectures for delivery and sense-making. At the heart of these systems lies an engine that shortens the path from "Got it." to "Got it!"T, i.e. the all-important path from content to insight. We will show how to couple this engine with context-mapping technologies in order to move beyond searching for documents to having the right insights delivered into the right heads at the right time. The talk concludes with blueprints from information-heavy industries, such as financial and legal services, which I expect will become widely adopted in the near future.

Bio: Pankaj Mehra is Distinguished Technologist at HP Labs, and founding member of Labs Russia. At HP, he created the information service, and played a leading role in creating NonStop Advanced Architecture (2004), Persistent Memory (2002), and Integrated Archive Platform (2005). He has served as Chairman of InfiniBand Trade Association’s Management Working Group (2000) and the CTO of Intellifabric Inc (2001). Pankaj designed major cluster systems around Windows NT that were (i) awarded the Terabyte Sort trophy by Jim Gray (1998), and (ii) created TPC-C cluster performance records (1997). Pankaj is currently a visitor at Stanford University’s Logic Group, and has in the past been: a faculty visitor to IBM Research; Visiting and regular faculty at IIT Delhi CS&E Department; and Computer Scientist at NASA Ames. His thesis work at Illinois about automated learning of strategies was published as a book by World Scientific in 1995; an edited collection about theory of neural networks, by IEEE in 1992; and an introductory book on storage, data and information systems, by HP Press in 2006. Pankaj guest edited IEEE Internet Computing magazine’s special issues in Global Deployment of Data Centers and Cloud Computing, and currently serves on the magazine’s editorial board.

04/09/2010 (10:30am SVC-6/Titan)

Title: Posterior Regularization

Speaker: Kuzman Ganchev, UPenn


This talk will present Posterior Regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. We have used Posterior Regularization to encode a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.

Posterior Regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. I will present an efficient algorithm for learning with posterior regularization and illustrate its usefulness in two applications: part of speech induction and statistical word alignment.

Bio: Kuzman Ganchev is a final year PhD student at the University of Pennsylvania, under Fernando Pereira and Ben Taskar. His research focus is on machine learning with partial or indirect supervision, especially applied to natural language processing. His alma mater is Swarthmore College, and he has worked or interned at StreamSage Inc, TrialPay Inc and Bank of America.

01/22/2010 (10:30am SVC-6/Titan)

Title: User-Generated Content using Econometric

Speaker: Panagiotis G. Ipeirotis, NYU

Abstract: Today, users post online reviews expressing their opinions for movies, restaurants, hotels, and many other products. They also evaluate merchants and react to news about political campaigns. Structuring and ranking these opinions in terms of importance and polarity is a difficult research problem. How can we infer the importance and polarity of the posted content? How can we structure and quantify the effect of the online opinions? Many existing approaches rely on human annotators to evaluate the polarity and strength of the opinions, a laborious and error- prone task. We take a different approach by considering the economic context in which an opinion is evaluated. We rely on the fact that the text in on-line systems influence the behavior of the readers and this effect can be observed using some easy-to-measure economic variables, such as revenues or product prices. Then, by reversing the logic, we infer the semantic orientation and the strength of an opinion by tracing the changes in the associated economic variable. In effect, we combine econometrics with text mining algorithms to identify the "economic value of text" and assign a "dollar value" to each opinion, quantifying sentiment effectively and without the need for manual effort. We present applications on reputation systems, online product reviews, and show how to improve search for hotels and products in general, using econometric approaches.

Bio: Panos Ipeirotis is an Assistant Professor at the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business of New York University. His area of expertise is databases and information retrieval, with an emphasis on management of textual data. His research interests include web searching, economic-aware text and web mining, data cleaning, and data integration. He received his Ph.D. degree in Computer Science from Columbia University in 2004 and a B.Sc. degree from the Computer Engineering and Informatics Department (CEID) of the University of Patras, Greece in 1999. He has received two Microsoft Live Labs Awards, two "Best Paper" awards (IEEE ICDE 2005, ACM SIGMOD 2006), two "Best Paper Runner Up" awards (JCDL 2002, ACM KDD 2008), and is also a recipient of a CAREER award from the National Science Foundation.

09/11/2009 (10:30am SVC-6/Titan)

Title: Agents and Cooperation in Social Tagging Systems

Speaker: Markus Strohmaier, Graz University of Technology, Austria

Abstract: Tags in social tagging systems contain what we can describe as “traces of user intent”, that is traces of user goals, motivations and intentions expressed in different degrees of explicitness. These traces enable researchers not only to study the semantic structure or the content of resources being tagged, but enables us also to study the goals, motivations and intentions of the (human) agents who are using these systems.

In this talk, I will present preliminary and ongoing work that focuses on two problems related to user intent in social tagging systems: 1) the automatic detection of users’ motivation for tagging 2) cooperation in social tagging systems.

I will begin by discussing ways to automatically detect users' motivation for tagging, and talk about the relevance and implications of user motivation in tagging systems. Subsequently, I will present results that suggest that a) tagging motivation can be measured in a content- and language-independent way and b) tagging motivation has an influence on the usefulness of tags for certain tasks such as information retrieval.

Second, I will talk about my current work at PARC on studying cooperation in social tagging systems. In this ongoing effort, we aim to understand the mechanisms and interactions that yield cooperative behavior among (human) agents in social tagging systems.

Bio: Markus Strohmaier is an Ass. Prof. at the Knowledge Management Institute at Graz University of Technology, Austria. From 03/2006 to 05/2007 he was a Post-Doctoral Research Fellow at the Department of Computer Science at University of Toronto, funded by an Erwin Schrödinger Postdoctoral Fellowship. Currently, he is a visiting scientist at PARC's Augmented Social Cognition Group. Markus is principal investigator of the national research grant "TransAgere" focusing on agents and goals on the social web (2007-2010). His research interests include user goals in the context of social media, search and knowledge management, focusing on the analysis and modeling of large-scale intentional web corpora and knowledge bases.

08/21/2009 (10:30am SVC-6/Titan)

Title: Output Bidding: A New Search Advertising Model Complementary to Keyword Bidding

Speaker: Ali Dasdan, Yahoo!

Abstract: We propose output bidding where advertisers bid to associate their advertisements with search results without changing or replacing any search result. I will talk about how output bidding complements the current search advertising model (keyword or input bidding). I will also discuss how output bidding can be implemented.

Bio: Ali Dasdan has been at Yahoo! since 2006. He manages a team doing all white-box metrics & monitoring as well as data/system analysis for web search. He has a PhD in Computer Science from UIUC. His home page at has the most recent list of his research publications.

08/12/2009 (2pm SVC-6/Titan)

Title: Optional Polya Tree and Bayesian Inference

Speaker: Wing H Wong, Stanford University

Abstract: We introduce an extension of the Polya Tree approach for constructing distributions on the space of probability measures. By using optional stopping and optional choice of splitting variables, the construction gives rise to random measures that are absolutely continuous with piecewise smooth densities on partitions that can adapt to fit the data. The resulting "optional Polya tree" distribution has large support in total variation topology, and yields posterior distributions that are also optional Polya trees with computable parameter values. The theory covers both discrete and continuous multivariate distributions.

07/31/2009 (3pm  SVC-6/Titan)

Title: Why everyone should be their own database administrator, UI designer, and Web 2.0 site developer, and how they can.

Title: David Karger, MIT and Microsoft

Abstract: The Web democratized content creation, turning millions of people into information producers who enthusiastically create information that benefits millions of information consumers. But the vast majority of such end-user-authored content is blobs of text, images, or video. In contrast, large-scale content providers offer richly structured data and Web-2.0 interfaces to it that support powerful filtering, sorting, visualization and exploration, making that information much more usable and useful.

In this talk I will demonstrate that the authoring of richly structured information, and of rich interactive visualizations of that information, can be put within reach of those same end users who have contributed so enthusiastically to the text web. Our approach is to abandon heavyweight notions of database management and user interface programming, and instead develop tools that support casual editing and mashups of rich data and visualizations in web pages, wikis, and blogs.

Our tools support the same rote copy/paste/modify flow of content creation that has so effectively let users create content for the text web without really understanding what they are doing.

07/24/2009 (10:30am SVC -6/Titan)

Title: Generating Bayes-Nash Equilibria to Design Autonomous Trading Agents

Speaker: Ioannis Vetsikas, Southampton University

Abstract: Online auctions have become a popular method for business transactions. The variety of different auction rules, the restrictions in supply or demand, and the agents’ combinatorial preferences for the different commodities, have led to the creation of a very complex multi-agent “environment”. Strategies designed for autonomous agents participating in such complex environments have often been ad-hoc based on the expert knowledge of a human specialist. In fact, this has been viewed as a necessary evil. This can be explained by the fact that a theoretical analysis and subsequent generation of strategies for such complex games is intractable.

To put game theory back into the design of trading agents for complex environments, I have proposed to decompose the general problem faced by the agent into several components, and restrict communication between them. This allows analyzing each component individually; some of the resulting sub problems are simple enough that equilibrium strategies can

be generated for these and then used in the overall strategy of the autonomous agent. To evaluate this approach, I used this analysis to generate strategies for the International Trading Agent Competition (TAC). I computed novel Bayes-Nash equilibria for first-price single-unit auctions and mth-price multi-unit auctions, when the auction has a set of possible closing times, one of which is chosen randomly for the auction to end at. A strategy inspired by this approach was used by my agent WhiteBear who participated in the competition. WhiteBear has won the competition a few times and has achieved the best overall performance in the competition among all participating agents.

Furthermore, in view of the fact that agents are in competition in this setting, I examine the behaviour of bidding agents that are in direct competition with the other participants in an auction setting. Thus the agents are not simply trying to maximize their own utility, rather they wish to maximize a weighted difference of their own gain to that of their competitors. I derive a symmetric Bayes-Nash equilibria for these agents in both m-th and (m+1)-th price sealed bid auctions. Subsequently, I use these equilibria to examine the profits of different agents and show that aiming to beat the competition is more effective than pure self interest in any competitive setting. Finally, I examine how the auctioneer’s revenue is affected and find that the weight that agents place in minimizing the opponents’ profit determines whether the m-th or the (m + 1)-th price auction yields a higher revenue.

Short bio: Ioannis Vetsikas received his Diploma (BSc with thesis) from NTU Athens, Greece and his MSc and PhD from Cornell University, USA. He is currently a senior research fellow at the IAM group of ECS School at Southampton University. His research interests lie in market-based systems, intelligent (trading) agents, auctions and related optimization issues, both from a theoretical and experimental standpoint, as well as applications.

10/09/2009 (10:30 am SVC-6/Titan)

Title: Efficient and Robust Processing of top-k Relational Queries.

Speaker: Alkis Polyzotis, UC Santa Cruz

Abstract: In the rank join problem, we are given a set of relations and a scoring function, and the goal is to return the join results with the top K scores. It is often the case in practice that the inputs may be accessed in ranked order and the scoring function is monotonic. These conditions allow for efficient algorithms that solve the rank join problem without reading all of the input.

This talk will present an overview of our recent work on the development of such rank join algorithms. We first demonstrate analytically that well known join algorithms do not stop reading their input as soon as possible, thus leading to bad performance. We find that it is NP-hard to overcome this weakness in the general case, but cases of limited query complexity are tractable. We show the latter with a rank join algorithm that infers provably tight bounds on the potential benefit of reading more input in order to stop as soon as possible. We also introduce a heuristic that enables the algorithm to deal gracefully with cases of high query complexity, where the computation of tight bounds is provably hard. Experimental results show that our algorithms outperform the state-of-the-art rank join algorithms by a significant margin, improving both I/O and total execution time.

Bio: Alkis Polyzotis is currently an assistant professor at UC Santa Cruz. His research focuses on database systems, and in particular on on-line database tuning, top-k queries, and P2P databases. He is the recipient of an NSF CAREER award in 2004 and of an IBM Faculty Award in 2005 and 2006. He received his PhD from the University of Wisconsin at Madison in 2003.

06/11/2009 (10:30 am SVC-6/Titan)

Title: Privacy into the Netflix Prize Contenders

Speaker: Ilya Mironov, Microsoft Research –SVC

Abstract: A recommender system based on collaborative filtering is a double-edged sword. By aggregating and processing preferences of multiple users it may provide personalized recommendations, serving an important business purpose. More disconcertingly, a recommender system is a potential source of leakage of private information shared by its users.

In the talk we will define differential privacy for recommender systems and briefly review leading approaches in the Netflix Prize competition. We will demonstrate how these approaches can be adapted to provide differential privacy, without significantly impairing their accuracy.

Joint work with Frank McSherry, MSR SVC, to appear in KDD'09.

Bio: Ilya Mironov is a researcher at Microsoft Research, Silicon Valley Campus. He works in the areas of cryptography, cryptanalysis, and privacy of statistical databases. Ilya received his PhD from Stanford in 2003, advised by Dan Boneh.


06/05/2009 (10:30 am SVC-6/Titan)

Title: Finding experts and initiators in social networks

Speaker: Evimaria Terzi, IBM Almaden

Abstract: The advent of online social networks and the need to exploit and analyze them has motivated lots of interesting research. In this talk, I will discuss two such problems and discuss the algorithmic challenges they pose.

In the first part of the talk, I will focus on the problem of expertise location in social networks: given a task that needs be performed and a pool of individuals organized in a social network, the problem asks for the subset of individuals in the network that can perform the input task with the least overhead. Different definitions of overhead lead to different formulations of the problem. In this talk, I will show how these definitions are related to existing NP-hard combinatorial problems and describe efficient approximation algorithms for solving them. I will also give some indicative experimental results that show the efficacy of our framework.

In the second part of the talk I will focus on a novel graph-reconstruction problem. Given a transaction database that shows which items were bought by different individuals, the task is to reconstruct the social network of individuals and find the initiating individuals for every product. I will give a formal definition of this problem and discuss algorithms for solving it.

I will also demonstrate the usefulness of our methods. These are joint works with T. Lappas, K. Liu and H. Mannila.

Bio: Evimaria Terzi has been a research scientist at IBM Almaden Research Center since June 2007. She obtained her PhD from the University of Helsinki (Finland) in January 2007, under the supervision of prof. Heikki Mannila. Before that, she got her MSc from Purdue University (USA) and her BSc from Aristotle University of Thessaloniki (Greece). Her research focuses on algorithmic aspects of data mining with applications in the analysis of sequential and graph data, ranking and clustering.

05/15/2009 (10:30 am SVC-6/Titan)

Title: SOFIE - A Self-Organizing Framework for Information Extraction

Speaker: Fabian Suchanek, Microsoft Research, Search Labs

Abstract: I will present SOFIE, a self-organizing framework for information extraction. SOFIE can extract facts from natural language documents, disambiguate the entities and relations, perform logical consistence checks with background knowledge and add the facts to an ontology. SOFIE does all of this in a unified framework, casting all of the subtasks into a weighted maximum satisfiability problem. In this context, I will also give a quick overview of the knowledge representation used in today's ontologies (RDFS) and show parallels to our commerce search Project.

Bio: Fabian studied Cognitive Science and received his Bachelor of Science in Osnabrueck/Germany. He did his Master of Science in Computer Science at the University of Saarbruecken/Germany. Fabian did his PhD at the Max-Planck Institute for Informatics under the supervision of Gerhard Weikum. He has worked in the areas of the Semantic Web, ontologies, natural language processing and information extraction. His main project was the YAGO ontology. Fabian recently joined the Search Labs of Microsoft Research in Silicon Valley, where he is working on the commerce section of Live search.

05/08/09 (10:30am, SVC-6/Titan)

Title: Learning to maximize expected ranking gain for information retrieval and collaborative filtering

Speaker:   Rich Zemel, University of Toronto

Abstract: Ranking a set of retrieved documents according to their relevance to a query is a popular problem in information retrieval (IR). Methods that learn ranking functions are difficult to optimize, as ranking performance is typically judged by metrics that are not smooth. In collaborative filtering (CF) the problem has a different flavor. Instead of queries we now have users, and the items to rank are sets of songs, books, movies etc., with relevance levels given by ratings provided by users. Even though the primary goal of CF is to recommend a small set of items to users (a ranking problem), almost all CF methods treat this as a rating classification problem. The learning algorithms are thus optimizing the wrong objective, and one that is actually harder. We propose a new approach to learning to rank, creating a probability distribution over rankings assigned to documents for a given query, which permits gradient ascent optimization of the expected value of the target performance measure. Experimental results on benchmark IR and CF data sets demonstrate the strong performance of our method relative to existing learning approaches to ranking. Joint work with Maksims Volkovs and Ben Marlin.

Bio: Richard Zemel is a professor in the Computer Science Department at the University of Toronto. He is on sabbatical this year at Berkeley. He obtained a B.A. at Harvard University, and his Ph.D. in Computer Science from the University of Toronto. He was a postdoctoral fellow at the Salk Institute and at Carnegie Mellon University, and a faculty member at the University of Arizona. He has received several awards and honors, including a Young Investigator Award from the Office of Naval Research and three Dean's Excellence Awards at the University of Toronto, and was recently appointed a Fellow of the Canadian Institute for Advanced Research. His research interests include topics in machine learning, vision, and neural coding.

01/23/09 (10:30am, SVC-6/Titan)

Title: Information Overload: Models and Methods to Live with it

Speaker:Georgia Koutrika, Stanford University

Abstract: If knowledge is power then it comes to no surprise why most of our community research boils down to the fundamental question of how we can master data. Our problem is not that we lack data rather that we have TOO much of it. Information overload is an open problem further complicated by the number and diverse needs of information seekers. The problem is not just finding the right information but finding the right information for the right person. In this talk, I will present research efforts for supporting personalized search results and recommendations. I will discuss user preference modeling over structured data and methods for personalizing searches for a user given a profile that stores the user's preferences. Finally, I will present an approach that allows recommendations to be easily defined, personalized, and processed over structured data.

Bio: Georgia Koutrika is a postdoctoral researcher at the CS Department of Stanford University in USA. She has a PhD on computer science, a MSc in advanced information systems, and a BSc in informatics from the University of Athens' Department of Informatics and Telecommunications in Greece. Georgia is interested in how people can find, visualize and interact with information and services and she has worked on user behavior and preference modeling, personalization, information access and sharing, recommendations, spam, query processing and optimization. She is particularly intrigued by the potential and challenges of social sites and a lot of her recent research is delivered through CourseRank, a social site for course planning developed at and used in Stanford. Georgia has received distinguished awards for her PhD, MSc, and undergrad studies from the Greek Scholarship Foundation (I.K.Y.). She has also earned young researcher study fellowships from the Onassis Foundation and the European Network of Excellence in Digital Libraries (DELOS). She has served as a PC co-chair of three workshops (in conj. with UM’07, NLDB’08, and VLDB’08) and as a PC member in several conferences and workshops, including ICDE, WWW, EDBT, and ESWC.

11/04/08 (11:00am, SVC-6/Titan)

Fabian M. Suchanek

YAGO: A Core of Semantic Knowledge

This talk will present YAGO. YAGO is a large ontology, which currently contains more than 2 million entities and close to 20 million facts about them. The talk will explain how YAGO was constructed and how we represent YAGO's knowledge. It will also give an overview of our latest project, SOFIE, which uses logical reasoning to extract new information for YAGO from Web sources. The talk concludes with some examples for applications of YAGO, including semantic search and an analysis of social tagging.


Fabian studied Cognitive Science and received his BSc in Osnabrueck/Germany. He did his MSc in Computer Science at the University of Saarbruecken/Germany. Currently, Fabian is doing his PhD at the Max-Planck Institute for Informatics under the supervision of Gerhard Weikum. Fabian just handed in his dissertation. He works in the areas of the Semantic Web, ontologies, natural language processing and information extraction. His main project is the YAGO ontology. During his PhD, Fabian spent 3 months at Microsoft Research Cambridge/UK, where he worked with Milan Vojnovic on Social Tagging.

08/28/08 (2:00pm, SVC-6/Titan)

Sunita Sarawagi

Structured learning models for information extraction

Feature-based structured models provide a flexible and elegant framework for various information extraction (IE) tasks. These include label sequences for traditional IE, segmentation models for entity-level extractions, and skip chain models for collective labeling. I will present efficient algorithms for finding the highest scoring prediction in graphs that arise when collectively labeling word sequences that are linked via associative edges.

Another challenging problem in structured learning is formulating training objectives that provide good generalization without requiring complicated inference during training. We review existing max-margin formulations and propose a new formulation that is better suited for decomposable error functions that arise in typical extraction tasks.


Sunita Sarawagi researches in the fields of databases, data mining, and machine learning. Her current research interests are information integration, graphical and structured models, and probabilistic databases. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining and several patents. She serves on the board of directors of ACM SIGKDD and VLDB foundation. She is program chair for the ACM SIGKDD 2008 conference and has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE, and ICML conferences. She is on the editorial board of the ACM TODS, ACM TKDD, and FnT for machine learning journals.

08/28/08 (11:00am, SVC-6/Mir)

Soumen Chakrabarti

Structured Learning for Non-Smooth Ranking Losses

Learning to rank from relevance judgment is an active research area. Itemwise score regression, pairwise preference satisfaction, and listwise structured learning are the major techniques in use. Listwise structured learning has been applied recently to optimize important non-decomposable ranking criteria like AUC (area under ROC curve) and MAP (mean average precision). We propose new, almost-linear-time algorithms to optimize for two other criteria widely used to evaluate search systems: MRR (mean reciprocal rank) and NDCG (normalized discounted cumulative gain) in the max-margin structured learning framework. We also demonstrate that, for different ranking criteria, one may need to use different feature maps. Search applications should not be optimized in favor of a single criterion, because they need to cater to a variety of queries. E.g., MRR is best for navigational queries, while NDCG is best for informational queries. A key contribution of this paper is to fold multiple ranking loss functions into a multi-criteria max-margin optimization. The result is a single, robust ranking model that is close to the best accuracy of learners trained on individual criteria. In fact, experiments over the popular LETOR and TREC data sets show that, contrary to conventional wisdom, a test criterion is often not best served by training with the same individual criterion.


Soumen Chakrabarti is Associate Professor of Computer Science at IIT-Bombay. He got a PhD from the University of California, Berkeley, in 1996, and was a Research Staff Member at IBM Almaden from 1996 to 1999. During Spring 2004 he was Visiting Associate Professor in the School of Computing, Carnegie-Mellon University. He has published extensively at conferences like WWW, SIGIR, EMNLP/HLT, SIGKDD, VLDB, and SIGMOD and also served frequently as vice chair or program committee member. His paper on Focused Crawling got the Best Paper award at WWW 1999, and others have been invited to Scientific American, IEEE Computer and VLDB Journal. As of 2006, his papers have over 1000 citations in CiteSeer. He has presented many tutorials at WWW, SIGMOD, VLDB and SIGKDD. These have led to a successful textbook, Mining the Web (, published in 2002, with a second edition currently in progress. He has eight granted US patents. At IIT Bombay, he has obtained research funding from Microsoft Research, IBM Research, GE Research, University of California, Tata Consultancy Services, and NEC Research, and delivered working software systems in several cases. For more details please visit

08/15/08 (1:30pm, SVC-6/Titan)

Qiaozhu Mei

Context Analysis in Text Mining and Search

The explosive growth of information demands powerful text mining tools to help us digest information and discover hidden knowledge in text. Text data is often associated with various types of context information, such as the time stamp of a news article, the geographical location where a blog article is written, the user who submitted a query to a search engine, and a social network of people. Given a collection of text data with context information, we often would like to extract the topics or themes from text and analyze their variational patterns over context. Such contextual patterns are very helpful for summarizing topics in text, as well as for providing a better targeted web search service. In this talk, we will present a general way of modeling and analyzing context in text with probabilistic models, and new algorithms for discovering and analyzing various contextual patterns, which we refer to as contextual text mining. The proposed models have broad applications in multiple domains to help understand topic evolutions, spatiotemporal impact of events, public opinions, detect topic related social communities, as well as utilize personalization in web search. We will discuss how to design concrete contextual topic models for different types of context, how to label probabilistic topic models, and how to regularize probabilistic models with harmonic functions defined on contexts.


Qiaozhu Mei is a Ph.D. candidate of Department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests include text mining and information retrieval, especially context analysis in text mining and retrieval with probabilistic language models and topic models. He has published extensively in these areas, and has received the Best Student Paper Runner-Up Awards of ACM KDD 2006 and ACM KDD 2007. He is also one of the five recipients of the inaugural Yahoo! Ph.D. Student Fellowship.

Please refer to his homepage for more information:

08/15/08 (10:30am, SVC-6/Mir)

Tom Minka

On the Optimization of Evaluation Metrics

Suppose you want a search engine that performs well according to some IR effectiveness metric M, such as NDCG. You've collected some labeled training data to tune the ranker. What metric should you optimize on this training data? Should it also be M? This assumption is very common in IR, yet a growing body of experimental evidence says "no". In fact, it is well known in the machine learning literature that optimizing M on training data does not necessarily lead to the best M at test time. I will go through these arguments and explain what the correct training metric should be.

(Joint work with Stephen Robertson.)

This is based onmy tech report "Empirical Risk Minimization is an incomplete inductive principle".


Tom Minka is a researcher in the Machine Learning and Perception Group at Microsoft Research Cambridge. His work focuses on probability propagation algorithms for graphical models. He is perhaps best known for developing the Expectation Propagation algorithm. He is also a co-inventor of the TrueSkill matchmaking algorithm used in Xbox Live. Previously, he was a Visiting Assistant Professor at Carnegie Mellon University. He completed his Ph.D. at the MIT Media Lab under the supervision of Rosalind Picard.

08/08/08 (1:30pm, SVC-6/Titan)

Heiko Stoermer (University of Trento, Italy)

OKKAM - A large-scale European Project for Entity-centric Information and Knowledge Management at Web Scale

Recent developments in the Web ecosphere reveal a common trend: keyword-based document search is becoming increasingly unsatisfactory in many domains and application scenarios. What can be observed are attempts to realize what we call "entity-centric" information and knowledge management, namely a change of paradigm in the management and access of information, away from a document-centric view, towards more precise and more structured approach in which (real-world) entities are the pivots of IKM. This trend manifests itself from two directions: on the one hand, we have the efforts of the Semantic Web community and their quest for web-scale Knowledge Representation by the masses; on the other hand, there is an increasing number of vertical domains and applications that center their information around entities such as geographical locations (Geonames, Yahoo), biomedical objects (LSID), publications (DOI), people/social networks (FOAF, Facebook, Xing, ...), and many others. Making these information "islands" usable in an integrated fashion is one of the main challenges that the Web is going to face in the near future.

In this talk I will argue that at the center of such a vision there is the need for good, re-usable entity identifiers as well as a scalable, open, and trustworthy backbone infrastructure for entity identifier management, which we call the Entity Name System (ENS). Only if we manage to foster global convergence on identifiers for entities will we be able to move the Web to a more structured and integrated source of knowledge that will allow for better, novel ways of web-scale IKM. From 2008 to 2010, the European Commission is funding the OKKAM large-scale Integrated Project which is pursuing precisely this goal. During the seminar, I will introduce motivations and use cases, give an overview of the OKKAM project, as well as illustrate in more details the current R&D challenges that we are addressing. For more details, see


Heiko Stoermer works as a post-doctoral researcher in the area of Knowledge Management and Information Systems at the University of Trento, Italy. After studies in Linguistics he moved to the area of Computer Science, where he holds a German university degree. He gathered professional experience as a software consultant and project manager, and opted for an academic career in 2004 when he received a scholarship at the International Doctorate School in Trento. In 2008, he was awarded with a PhD for his work on Identity and Reference on the Semantic Web. His research interests include information integration, semantic interoperability and contextual knowledge representation. He has (co-) authored a number of scientific publications, acted as a reviewer for important international conferences and is co-organizer of several workshops. Currently, he is serving as UNITN's R&D Manager for the European large-scale Integrated Project OKKAM.

07/11/08 (1:30pm, SVC-6/Titan)

Jeff Jonas (IBM)

Beyond Search: The Data Must Find the Data and the Relevance Must Find the User

Why "search" is second fiddle to "the data must find the data and the relevance must find the user."


Jeff Jonas is chief scientist of the Entity Analytic Solutions group and an IBM Distinguished Engineer. In these capacities, he is responsible for shaping the overall technical strategy of next generation identity analytics and the use of this new capability in the overall IBM technology strategy.

The IBM Entity Analytic Solutions group was formed based on technologies developed by Mr. Jonas as the founder and chief scientist of Systems Research & Development (SRD). SRD was acquired by IBM in January 2005.

Today, Mr. Jonas applies his real world and hands on experience in software design and development to drive technology innovations while delivering higher levels of privacy and civil liberties protections. By way of example, the most recent breakthrough developed by Mr. Jonas involves an innovative technique enabling advanced data correlation while only using irreversible cryptographic hashes. This new capability makes it possible for organizations to discover records of common interest (e.g., identities) without the transfer of any privacy invading content. This privacy-enhancing technology known as anonymous entity resolution delivers extraordinary new levels of privacy protection while enabling technology to contribute to critical societal interests like clinical health care research, aviation safety, homeland security, fraud detection and identity theft.

Jeff Jonas’s innovations have received coverage in such publications as The Wall Street Journal, Washington Post, Fortune, and Computerworld and have been featured on ABC Primetime with Peter Jennings, The Discovery Channel, The Learning Channel and MSNBC. Known for his dynamic presentational style, he is a popular speaker on technology, security and privacy and has spoken at events such as the Federal Convention on Emerging Technologies Forum on Homeland Security, National Security Agency’s INFOSEC Seminar Series, American Society for Industrial Security, Black Hat, PC Forum, Wharton Technology Conference, National Retail Federation Annual Fraud Conference and Computers, Freedom and Privacy Conference.

Mr. Jonas is a member of the Markle Foundation Task Force on National Security in the Information Age and actively contributes his insights on privacy, technology and homeland security to leading national think tanks, privacy advocacy groups, and policy research organizations, including the Center for Democracy and Technology, Heritage Foundation and the Office of the Secretary of Defense Highlands Forum. Mr. Jonas is a senior associate to the Center for Strategic and International Studies and is a member of the National Academies’ Committee on State Voter Registration Databases.

Jeff Jonas blogs at:

06/27/08 (1:30pm, SVC-6/Titan)

Umar Syed (Princeton University / Microsoft Research)

A Game-Theoretic Approach to Apprenticeship Learning

We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng (2004) who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire (1999) for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we provide an experimental demonstration of the algorithm on a toy video-game environment.

Joint work with Robert E. Schapire.


Umar Syed is a fourth year Ph. D. student in the Computer Science department at Princeton University, advised by Rob Schapire. His research interests include reinforcement learning, online learning and game theory.

06/20/08 (9:30am, SVC-6/Titan)

Aaron Roth (CMU / Microsoft Research)

A Learning Theory Approach to Non-Interactive Database Privacy

We consider the problem of privately releasing a dataset so that others may usefully learn from it. Previous work shows that for some classes of queries, any dataset useful for all queries in that class would allow an adversary to reconstruct almost the entire dataset. However, we show that (at least if one ignores computational constraints), it is possible to privately release a dataset while still being useful for all queries natural for learning. More specifically, we show that for any class of functions C, it is possible to privately release a data set that is simultaneously useful for every function in C, given a sample of size only polynomial in the VC-dimension of C. We note that the sample complexity necessary for learning guarantees and privacy is only slightly larger than the sample complexity needed for learning alone, and so “anything that can be reliably learned from a dataset can be reliably learned from a privacy-preserving dataset of only slightly larger size”. We also give efficient data release mechanisms for some simple classes of functions, like axis-aligned rectangles of constant degree, and large margin half spaces.

This is joint work with Avrim Blum and Katrina Ligett.


Aaron has just finished his second year as a PhD student at Carnegie Mellon, under the supervision of Avrim Blum. Currently, he is interning at MSR-SVC, working with Dr. Liad Blumrosen.

06/20/08 (9:30am, SVC-6/Titan)

Reid Andersen (Microsoft Research)

Local Graph Algorithms

Local graph algorithms perform small computations by adaptively exploring a small neighborhood of a graph near a specified target vertex. They are highly scalable, with running times that are essentially independent of the size of the graph. Local algorithms have been used to identify communities in web graphs, find link exchange sites, and extract high-value submarkets from advertiser-keyword spending graphs. In this talk, we describe a local algorithm for the following task: identifying the web pages that contribute the most to the PageRank score of a specified target page. We give bounds on the number of nodes that the algorithm must examine to approximately compute the top contributors. The ideal platform for implementing local algorithms is a link server that provides fast access to the link data of a set of adaptively selected pages. We describe an implementation of the PageRank contribution algorithm on LiveLink, a high-performance link server hosting a 20-billion node web graph snapshot.


Reid Andersen is a postdoc in the theory group at MSR, Redmond. His current research focus is on local algorithms, algorithms for large bipartite graphs and web graphs, and graph partitioning and community identification.


Sam Roweis (University of Toronto)

Making The Sky Searchable: Large Scale Astronomical Pattern Recognition

Imagine you have a picture of the night sky and you want to know where your telescope was pointing when the picture was taken. Several extremely large digital astronomy catalogues are available which hold (among other data) the positions and magnitudes of billions of stars covering the sky. So you should, in principle, be able to search these sky catalogues and find where your image matches. The only catch is that the sky is really big. However, if you are clever you can design a statistical pattern recognition system that solves this problem on a single workstation in less than a second. In collaboration with astronomers at NYU, my group has built such a universal astronomy search engine. Roughly speaking, this engine takes as input any picture of the night sky, and returns as output the location (and orientation) on the sky at which the picture was taken. Current "calibration" systems can do this kind of alignment only if the user provides a very good initial guess at position and orientation, but our system works even when no initial guess is available. In order to achieve this, we have built a massive geometric hashing system, which stores millions of "micro-features" and matches them in an inverted index against features found in a new test image. This matching process very rapidly generates candidate locations on the sky which we can then cheaply verify or refute. As a demonstration of the power of the system we have re-solved, completely blindly, the astrometry of every field ever taken by the Sloan Digital Sky Survey, solving over 99.9% of the approximately 400,000 images correctly with zero false positive matches. We have also used it to solve for the locations of astronomical images from historical plate archives, amateur astronomer pictures and figures from old research papers. In the future, we envision our engine being used for real-time verification of telescope software, to recover position information from all legacy data of astronomical images, and for "collaborative observing" between amateur and professional astronomers around the globe.

BONUS: If you email me an image I'll try to solve it with our system and show the results live during the talk.

Joint work with Dustin Lang, Keir Mierle, Michael Blanton and David Hogg.

Project Website:


Sam Roweis is an Associate Professor in the Department of Computer Science at the University of Toronto. His research interests are in machine learning, data mining, and statistical signal processing. Roweis did his undergraduate degree at the University of Toronto in the Engineering Science program and earned his doctoral degree in 1999 from the California Institute of Technology working with John Hopfield. He did a postdoc with Geoff Hinton and Zoubin Ghahramani at the Gatsby Unit in London, and was a visiting faculty member at MIT in 2005. He has also worked at several industrial research labs including Google, Bell Labs, Whizbang! Labs and Microsoft. He is the holder of a Canada Research Chair in Statistical Machine Learning, a Sloan Research Fellowship, the winner of a Premier's Research Excellence Award, and a Scholar of the Canadian Institute for Advanced Research.

06/06/08 (1:30pm, SVC-6/Titan)

Tao Cheng (UIUC / Microsoft Research)

Entity Search: Search for what you what

As the Web has evolved into a data-rich repository, with the standard "page view", current search engines are increasingly inadequate. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. Towards searching directly for finding information of finer granularity, we propose the concept of entity search, a significant departure from traditional document retrieval. In particular, we focus on the core challenge of ranking entities, and develop EntityRank, a probabilistic model that integrates both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively. This presentation is mainly about the works published in CIDR 07, SIGMOD 07 (demo), and VLDB 07.


In addition to the presentation, a LIVE and INTERESTING demo over large-scale dataset will be given.


Tao Cheng is a fourth year PhD student from the University of Illinois at Urbana-Champaign. His research interests are mainly about web search and mining. He works with Prof. Kevin Chang at UIUC on the WISDM project, proposing and studying a new paradigm of search called entity search – to search over the fine granularity entities embedded in web pages. Tao is a recipient of Yahoo 2007 Key Technical Challenge Award. Currently, he is interning at the Search Labs at Microsoft, working together with Dr. Stelios Paparizos.

05/27/08 (10:30am, SVC-6/Titan)

Gerhard Weikum (Max-Planck Institute for Informatics, Germany)

Harvesting, Searching, and Ranking Knowledge from the Web

There is a trend to advance the functionality of search engines to a more expressive semantic level. This is enabled by employing large-scale information extraction of entities and relationships from semistructured as well as natural-language Web sources. In addition, harnessing Semantic-Web-style ontologies and reaching into Deep-Web sources can contribute towards a grand vision of turning the Web into a comprehensive knowledge base that can be efficiently searched with high precision.

This talk presents ongoing research at the Max-Planck Institute for Informatics towards this objective, centered around the YAGO knowledge base and the NAGA search engine. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques.

This is joint work with Georgiana Ifrim, Gjergji Kasneci, Maya Ramanath, and Fabian Suchanek.

Short Bio:

Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also the spokesperson of the International Max-Planck Research School (IMPRS) for Computer Science. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany. He was elected as an ACM Fellow in 2005. He has also received CIDR 2005 Timeless Idea Award, VLDB 10-Year Award (2002), and ACM SIGMOD Conference 1998 Best Paper Award (for joint paper with David Lomet).


03/21/08 (1:30pm, SVC-6/Titan)

Ashish Goel (Stanford University)

Towards robust reputation and recommendation systems

Reputation and recommendation systems play an immense role in the success of the web. For example, search engines not just find but also rank information. And sites such as Netflix and Amazon don't just distribute movies/books but also make relevant recommendations. Both reputation and recommendation systems suffer from the problems of malicious spam, bad-mouthing of competitors, self-promotion via proxies, and inadequate feedback. We present algorithmic and incentive based approaches to making these systems robust. In particular, we present a ranking algorithm and an incentive mechanism that provides an arbitrage opportunity to users whenever a low quality entity is ranked above a high quality entity.

Social networks need to be robust not just to traditional security attacks but also to social misbehavior such as spam, bad-mouthing, and free-loading. On a more philosophical level, we make the case that algorithmic techniques alone will not provide social robustness, and we will also need incentives and other social primitives.


Ashish Goel is an Associate Professor of Management Science and Engineering and (by courtesy) Computer Science at Stanford University, and a member of Stanford's Institute for Computational and Mathematical Engineering. He received his PhD in Computer Science from Stanford in 1999, and was an Assistant Professor of Computer Science at the University of Southern California from 1999 to 2002. His research interests lie in the design, analysis, and applications of algorithms. Professor Goel is a recipient of an Alfred P. Sloan faculty fellowship (2004-06), a Terman faculty fellowship from Stanford, and an NSF Career Award (2002-07).

Professor Goel is currently also the 3COM faculty development scholar in Stanford's School of Engineering.



02/26/08 (10:30am, SVC-6/Titan)

Joe Konstan

Help is Out There: Online Community, Community Artifacts, and a New Way of Harnessing Knowledge

Have a question? Go to wikipedia. Or browse a discipline-specific archive. Or even just post it online and hope people will help. This talk discusses several ongoing studies that are part of CommunityLab--a collaboration to understand how to design online communities to elicit participation. From them, we can draw lessons on how social psychology, economics, and other sciences of human behavior can be harnessed to understand and then design effective online communities.


Joseph A. Konstan is Professor of Computer Science and Engineering at the University of Minnesota. His research addresses a variety of human-computer interaction issues, including personalization (particularly through recommender systems), eliciting on-line participation, and designing computer systems to improve public health. He is probably best known for his work in collaborative filtering recommenders (the GroupLens project), and for his work in online HIV prevention. He is co-author of Word of Mouse: The Marketing Power of Collaborative Filtering, a book that reviews three dozen good and poor examples of personalization in research and deployed systems.

Dr. Konstan received his Ph.D. from the University of California, Berkeley in 1993. He is an ACM Distinguished Scientist. Previously, he has served as an ACM Distinguished Lecturer and an IEEE Distinguished Visitor. Dr. Konstan served from 2003-2006 as President of ACM SIGCHI, the 4500-member Special Interest Group on Human-Computer Interaction. He currently serves as Chair of ACM’s Special Interest Group Governing Board, and as a member of ACM’s Executive Committee. Dr. Konstan is an active consultant who has worked for more than 15 companies on issues related to human-computer interaction, personalization, and general software issues. He has traveled and lectured extensively, giving over 200 talks in more than 25 countries worldwide.

02/05/08 (10:30am, SVC-6/Titan)


Keyword Search on Graph-Structured Data

A variety of types of data, such as relational, XML and HTML data can be naturally represented using a graph model of data, with entities as nodes and relationships as edges. Graph representations are also a natural choice for representing information extracted from unstructured data, and for representing information integrated from heterogeneous sources of data.

Keyword search is a natural way to retrieve information from graph representations of data, especially in the common situation where the graphs do not have a well-defined schema. Unlike text or Web search, there is no natural notion of a document, and information about a single conceptual entity may be split across multiple nodes. Answers to queries are therefore usually modeled as trees that connect nodes matching the keywords, with each answer tree having an associated score. The goal of keyword search then is to find answers with the highest scores. A number of systems, including the BANKS system developed at IIT Bombay, are based on such a model for answering keyword queries.

In this talk we first outline background material on keyword querying on graph data, including models for ranking of answer trees. We then focus on search algorithms for finding top-ranked answers. We outline key issues in finding top-ranked answers, and present algorithms that address the problem in the context of in-memory graphs. We then consider the problem of search on external memory graphs, and briefly outline our on-going work on external memory graph search based on a multi-granular graph representation.


S. Sudarshan received the Ph.D. from the Univ. of Wisconsin, Madison in 1992. He was a Member of the Technical Staff in the database research group at AT&T Bell Laboratories, from 1992 to 1995, and he has been at the Indian Institute of Technology (IIT), Bombay since 1995. He spent a year on sabbatical at Microsoft Research, USA in 2004-05. Sudarshan's research interests center on database systems, and his current research projects include keyword querying on databases, processing and optimization of complex queries, and database support for securing and testing applications. Sudarshan is a co-author of the widely used textbook, Database System Concepts, 5th Ed., by Silberschatz, Korth and Sudarshan.

10/19/07 (1:30pm, SVC-5/2200 Neptune)

Yuri Lifshits

Similarity Search: A Web Perspective

Similarity search is the problem of preprocessing a database of N objects in such a way that given a query object, one can effectively determine its nearest neighbors in database.

Similarity search is closely connected to many algorithmic problems in the web. In this talk we will focus on personalized news aggregation (searching for news articles that are most similar to the user's profile of interests) and behavioral targeting (searching for the most relevant user for displaying a given advertisement). Other web-applications include near-duplicate detection, (2) finding related pages, classiffcation/fitering tasks like detecting spam pages, spelling correction, sense disambiguation, computing co-occurrence similarity, social network analysis (e.g. suggesting new friends), recommendation systems.

We describe features that make web applications somewhat different from previously studied models. Thus we re-examine the formalization and the classical algorithms for similarity search. This leads us to new algorithms (we present two of them) and numerous open problems in the field.


Yury Lifshits obtained his PhD degree from Steklov Institute of Mathematics at St.Petersburg (2007). He is currently a postdoc at Caltech. He won two gold medals at International Mathematical Olympiads, received The Best Paper Award in Application Track of CSR'07 and The Yandex Teaching Award (2006) for his course "Algorithms for Internet".

Yury Lifshits Homepage:

The Homepage of Nearest Neighbors (maintained by Yury):

08/24/07 (1:30pm, SVC-5/2200 Neptune)

Chen Li

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Many applications have the emerging need to answer approximate string queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. We will report our recent results on solving related problems. We will focus on a new technique, called VGRAM. It improves those algorithms that use fixed-length grams, which are substrings of a string used as signatures to identify similar strings. The main idea of VGRAM is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms. Related results have appeared in recent VLDB papers.


Chen Li is an associate professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google. His research interests are in the fields of database and information systems, including data cleansing, data integration and sharing, data warehousing, and data privacy.

08/06/07 (1:30pm, SVC-5/2200 Neptune)

Juliana Fereire

Discovering and Organizing Hidden-Web Sources

The hidden Web consists of sites whose contents typically reside in databases and are only exposed on demand, as users fill out and submit forms. As the volume of information in the hidden Web grows, there is an increased need for techniques and tools that allow users and applications to uncover and leverage this information. Several applications attempt to make the hidden information more easily accessible, including metasearchers, hidden-Web crawlers, online database directories, and Web information integration systems. Since, for a particular domain of interest, there are many hidden-Web sources with varied coverage that need to be integrated or searched, a key requirement for these applications is the ability to locate relevant sources.

At the University of Utah, we are developing a a scalable infrastructure that automates, to a large extent, the process of discovering and organizing hidden-Web sources. In this talk, I will describe two important components of this infrastructure: an adaptive Web crawler that automatically learns patterns of promising links and adapts its focus as the crawl progresses substantially improving harvest rates, i.e., the number of relevant forms retrieved per pages crawled; and a classifier ensemble that combines different learning techniques to accurately identify forms as belonging to a given domain.

Joint work with Luciano Barbosa. This work has been partially funded by the National Science Foundation and a University of Utah Seed Grant.


Juliana Freire joined the faculty of the School of Computing at the University of Utah in July 2005. Before, she was member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies) and an Assistant Professor at OGI/OHSU. Juliana's research has focused on extending traditional database technology and developing techniques to address new data management problems introduced by the Web and scientific applications. She is an active member of the database community, having co-authored over 60 technical papers and holding 4 U.S. patents. She has participated as a program committee member in over 40 events. She is a vice-chair of WWW2008; served as vice-chair for ICDE2007 and WWW2005; and co-chaired WebDB2003. Her research has been funded by grants from the National Science Foundation, Department of Energy and the University of Utah.

08/01/07 (1:30pm, SVC-5/2400 Gulf)

Nick Koudas

BlogScope: A system for the analysis of the Blogosphere

BlogScope (, is a system for the analysis of the Blogosphere currently under development at the University of Toronto. BlogScope is an information discovery and text analysis system that offers a set of unique features. Such features include, spatio-temporal analysis of blogs, flexible navigation of the Blogosphere through information bursts, keyword correlations and burst synopsis, as well as enhanced ranking functions for improved query answer relevance. I will describe the system in its current state, our ongoing work in progress and detail our future plans and research agenda. I will also describe SPIDER a flexible string matching system, outline its functionality and touch on the principles of the underlying technology.

If time permits I will show demos of both systems.


Nick Koudas is a professor at the department of Computer Science at the University of Toronto. Prior to joining UofT he was a principal member of technical staff at ATT Research and an adjunct professor at Columbia University. He works on web data management, data analysis, performance and algorithms.

07/19/07 (3:00pm, SVC-5/2400 Gulf)

Soumen Chakrabarti

Ranking for keyword queries in entity-relationship graphs: Machine learning and performance issues

This talk In statistical machine learning, graphs are increasingly used as a powerful common representation to model complex interactions between random variables. In contrast, in the Web community, the predominant graph is the Web graph, and the predominant function served by the graph is PageRank-style ranking. The integration of Web, database and IR technologies promises a far more ambitious agenda of representing, searching and mining richer data---XML, tables, social networks, blogs---in the form of unified and integrated entity-relationship (E-R) graphs with typed nodes and edges, where nodes and edges can also have text and structured attributes. We are researching two facets of E-R graph search.

First, how to automate the design of ranking functions in E-R graph search, and how to extend relevance feedback from the IR world to the E-R graph world? In PageRank, the link graph fixes the parameters of a Markovian walk. Instead, here we are interested in an inverse problem: given partial orders between nodes to be satisfied by the PageRank vector, learn the random walk parameters so that the full PageRank vector would complete the partial order to generalize to unseen preferences. Obviously, this can be done only if we assume some form of smoothness or regularity in the preferences. We will discuss three formulations involving graph Laplacians and network circulations, give algorithms and quote some generalization guarantees.

Second, having learnt the Markov walk parameters, we must answer queries over large E-R graphs within interactive time. Unlike static PageRanking, this is nontrivial if parts of the graph are materialized only at query time, which is inevitable if words and attribute values are transient nodes in the graph. What sort of preprocessed indices can we build to assist interactive speed dynamic graph conductance queries?

Can we set up a graceful tradeoff between preprocessing time, index space, query speed, and result accuracy? Can we harness query log statistics to optimize our indices, as is common in relational databases? We will give top-k graph conductance ranking algorithms that satisfy all the above properties.

(Joint work with Alekh Agarwal, Manish Gupta, Amit Pathak and Kriti Puniyani. Supported partly by IBM Research and Microsoft Research.)


06/19/07 (1:30pm, SVC-5/2200 Neptune)

Geoff Webb

Finding the Real Patterns

Pattern discovery is one of the fundamental tasks in data mining. Pattern discovery typically explores a massive space of potential patterns to identify those that satisfy some user-specified criteria. This process entails a huge risk (in many cases a near certainty) that many patterns will be false discoveries. These are patterns that satisfy the specified criteria with respect to the sample data but do not satisfy those criteria with respect to the population from which those data are drawn. This talk discusses the problem of false discoveries, and presents techniques for avoiding them.


Geoff Webb holds a research chair in the Faculty of Information Technology at Monash University. Prior to Monash he held appointments at Griffith University and then Deakin University where he received a personal chair. His primary research areas are machine learning, data mining, and user modelling. He is widely known for his contribution to the debate about the application of Occam's razor in machine learning and for the development of numerous algorithms and techniques for machine learning, data mining and user modelling. His commercial data mining software, Magnum Opus, is marketed internationally by Rulequest Research. He is editor-in-chief of the highest impact data mining journal, Data Mining and Knowledge Discovery and a member of the editorial boards of Machine Learning, ACM Transactions on Knowledge Discovery in Data, and User Modeling and User-Adapted Interaction.

06/15/07 (1:30pm, SVC-5/2200 Neptune)

Hao Chen

From Forum Spam to Syndication-based Spam Infrastructure

Forum spam has become a major means of search engine spam. To evaluate the impact of forum spam on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. We propose context-based analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75% and 54% respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1% of the blogs are spam. The observation on our honey forums confirms that most comment spam is meant to increase page rank rather than generate immediate traffic.

We have further observed that comment spam is often part of some syndication-based spam infrastructure, which connects spammers with advertisers using Web redirection. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords -- one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

This is a joint work with Yi-Min Wang and Ming Ma at Microsoft Research, Redmond, and Yuan Niu and Francis Hsu at UC Davis.


Hao Chen is an assistant professor at the University of California, Davis. He received his Ph.D. in Computer Science from the University of California, Berkeley in 2004. He has a broad range of interests in computer security, including Web security, cellular network security, firewall security, and software security. He received the National Science Foundation CAREER award in 2007.

05/25/07 (1:30pm, SVC-5/2200 Neptune)

Chris Olston

Navigation-Aided Retrieval

Users searching for information in hypermedia environments often perform *querying* followed by manual *navigation*. Yet, the conventional text/hypertext retrieval paradigm does not explicitly take post-query navigation into account. This talk proposes a new retrieval paradigm, called *navigation-aided retrieval* (NAR), which treats both querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting points for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the probabilistic information retrieval model.


Christopher Olston is a senior research scientist at Yahoo! Research, after a stint as assistant professor at Carnegie Mellon University from 2003 to 2005. His research interests include data management and web search. Olston received his Ph.D. in 2003 from Stanford University, which is a really nice place to rollerblade.

05/18/07 (1:30pm, SVC-5/2200 Neptune)

Anthony Kosky

Exploring Heterogeneous Data bases and Web Data Sources with the OPM Multi-database Tools

The OPM Multi-database Query System was developed in the late 90's at Lawrence Berkeley National Lab, and later at Gene Logic, with collaboration from researchers at Glaxo Smith-Kline, as an extension of the Object Protocol Model (OPM) set of tools. The system was targeted at the primarily at the bioinformatics domain, and has been used internally within Gene Logic, for various projects within LBNL and as part of the TI-Net project at Glaxo Smith-Kline.

The OPM tools provided a basis for a federated database system in which the component data sources could be traditional relational databases, web data sources, XML or flat-file databases or external applications such as homology-search engines (BLAST). Data sources maintained independent schemas, but were viewed through a common data model (OPM) and supported a uniform query interface. A Java-applet based graphical user interface allowed users to navigate between data sources and dynamically create web query forms. The OPM Multi-database Query System included some additional novel features, such as the use of inter-database links for navigating between databases, and support for application specific data types (ASDTs), not found in other federated database systems, and other features, such as configurable query set specifications for web data sources, which have only recently appeared in the latest generation of mediated systems.

In this talk I will be providing an overview of the OPM multi-database tools, focusing on some of the novel and innovative features, and comparing the approach taken to the current generation of tools and approaches for federated database systems.


Anthony Kosky received his PhD in Computer Science from the University of Pennsylvania in 1995. His thesis was entitled Transformating Databases with Recursive Data Structures, and looked at the interplay between data integration, database transformations and database constraints.

Subsequently he joined the Data Management Research and Development Group at Lawrence Berkeley National Laboratory in order to work on the OPM project to develop database development and query tools based on an object-relational mapping. In 1997 he joined Gene Logic Inc., together with other members of the OPM group, in order to commercialize the OPM tools for the bioinformatics and life sciences sectors. While at Gene Logic, he became involved in the GeneExpress and GX Connect projects to provide data management, exploration and analysis tools for large databases of gene expression and related genomics and clinical data.

Most recently he has been working for Axiope Inc., a spin-out of the University of Edinburgh, developing XML-based data management systems for bio-medical applications.

05/11/07 (1:30pm, SVC-5/2200 Neptune)

Ramakrishnan Akella

Towards Personalization and Service Platforms

We describe elements of our research as we consider developing Service Platforms for Collaborative Search. Our goal is to enable the more effective functioning of users who are looking at both web and internal document collections, with the intent to provide service to a set of demanding users with complex queries. We have designed user interactive models for information retrieval and document clustering. Two specific set of results we have obtained concern:

• Active Learning: We have incorporated a combination of diversity, retrieval engine scores and relevance feedback into an active learning algorithm. This algorithm significantly outperforms existing algorithms by up to 15% on TREC HARD data.

• Semi-supervised clustering: We have developed a more probabilistic model to incorporate must and must-not link constraints on documents in clusters. We develop a new semi-supervised clustering algorithm which builds on this, and Bayesian regularization. This algorithm outperforms those in the literature by 15% again on 20 News group data set.

We will also touch upon ongoing and proposed research:

• More principled models which build upon our current active learning model, including relevance feedback and subspace projection methods.

• Collaborative search, to share learning across users by analyzing the cooccurrence between user clusters, query clusters and web-page clusters.

• Social network analysis in enterprise search, discussing how interactions among customers and engineers can be used to improve search performance.

(Joint work with Zuobing Xu, Seinjuti Chakraborty, and Jyotsna Gangwar)

10/26/06 (3:00pm, SVC-5/2200 Neptune)

Sudarshan Murthy (Portland State University)

Sub-search: Searching at Sub-document Granularity

Search-engine users are often interested in finding relevant sub-documents (that is, document fragments), but search engines return document lists without much indication of why a returned document is relevant, and which sub-documents are relevant. Consequently, users perform intra-document searches to find relevant sub-documents on their own, and judge their relevance. For example, a user may employ the find function in a browser to examine the occurrences of a query term in an HTML document. This situation exists because current search systems essentially treat the document as both the unit of transfer and the unit of comprehension or consultation. Sub-search (that is, sub-document search) makes the sub-document a unit of comprehension or consultation.

Two improvements to current search systems can facilitate sub-search: First, enhance search results to provide insight into sub-documents that make a document relevant. Second, allow users to navigate directly from an enhanced search result to a sub-document of interest. In both cases, sub-documents should not be limited to those defined by document authors. Also, the improvements should work for any type of document. These improvements allow users to judge the relevance of sub-documents without opening documents, and reduce user-effort to view sub-documents in their original context.

We present one approach to sub-search based on the notion of superimposed information, and demonstrate a prototype implementation. We also outline alternative approaches, and introduce some ideas on aggregating and ranking sub-documents.

10/18/06 (2:00pm, SVC-5/2200 Neptune)

Heikki Mannila (Helsinki)

Algorithms for Discovering Bucket Orders from Data

We consider bucket orders, i.e., total orders with ties. They can be used to capture the essential order information without overfitting the data: they form a useful concept class between total orders and arbitrary partial orders. We address the question of finding a bucket order for a set of items, given pairwise precedence information between the items. We describe simple and efficient algorithms for finding good bucket orders. Several of the algorithms have a provable approximation guarantee, and they scale well to large datasets. We provide experimental results on artificial and a real data that show the usefulness of bucket orders and demonstrate the accuracy and efficiency of the algorithms.

09/26/06 (1:00pm, SVC-5/2200 Neptune)

Kwong Yung (

Discovering Semantic Themes and Social Networks from Email Corpus

The Enron corpus of email messages is analyzed using several novel extensions of latent Dirichlet allocation, a Bayesian model of document generation. The body of an email message is viewed as the usual bag of words. In addition to the textual body, however, the author and the recipients of the email message also provide important clues about the content of the email message. Incorporating social network underlying the email corpus not only improves discovery of semantic themes in the corpus but also identifies roles of actors in the network. Leveraging both the semantic content and the social network underlying an email corpus makes information retrieval far more powerful. The standard LDA model is extended in two directions. In one direction, the author-recipient-topic model of McCallum 2005 is cast in a two-dimensional framework of horizontal and vertical segmentation of topics. In the other direction, the hierarchical model of Blei 2004 is relaxed by allowing multiple paths for each document. Combining these two threads of improvements on LDA results in advances on both theoretical and practical fronts.

01/23/09 (10:30am, SVC-6/Titan)

Title: Information Overload: Models and Methods to Live with it

Speaker:Georgia Koutrika, Stanford University

Abstract: If knowledge is power then it comes to no surprise why most of our community research boils down to the fundamental question of how we can master data. Our problem is not that we lack data rather that we have TOO much of it. Information overload is an open problem further complicated by the number and diverse needs of information seekers. The problem is not just finding the right information but finding the right information for the right person. In this talk, I will present research efforts for supporting personalized search results and recommendations. I will discuss user preference modeling over structured data and methods for personalizing searches for a user given a profile that stores the user's preferences. Finally, I will present an approach that allows recommendations to be easily defined, personalized, and processed over structured data.

Bio: Georgia Koutrika is a postdoctoral researcher at the CS Department of Stanford University in USA. She has a PhD on computer science, a MSc in advanced information systems, and a BSc in informatics from the University of Athens' Department of Informatics and Telecommunications in Greece. Georgia is interested in how people can find, visualize and interact with information and services and she has worked on user behavior and preference modeling, personalization, information access and sharing, recommendations, spam, query processing and optimization. She is particularly intrigued by the potential and challenges of social sites and a lot of her recent research is delivered through CourseRank, a social site for course planning developed at and used in Stanford. Georgia has received distinguished awards for her PhD, MSc, and undergrad studies from the Greek Scholarship Foundation (I.K.Y.). She has also earned young researcher study fellowships from the Onassis Foundation and the European Network of Excellence in Digital Libraries (DELOS). She has served as a PC co-chair of three workshops (in conj. with UM’07, NLDB’08, and VLDB’08) and as a PC member in several conferences and workshops, including ICDE, WWW, EDBT, and ESWC.

08/19/2011 (10:30am SVC-6/Titan)

Title: Guided Interaction: Rethinking the Query-Result Paradigm

Speaker: Arnab Nandi

Abstract: Decades of research, coupled with continuous increases in computing power, have enabled highly efficient execution of queries on large databases. For many databases, far more time is spent by users formulating queries than by the system evaluating them. It stands to reason that, looking at the overall query experience we provide users, we should pay attention to how we can assist users in the holistic process of obtaining the information they desire from the database, and not just the constituent activity of efficiently generating results given a complete precise query.

In this talk, we examine the conventional query-result paradigm employed by databases and demonstrate challenges encountered when following this paradigm. We recognize that the process of query specification itself is a major stumbling block. With current computational abilities, we are at a point where we can make use of the data in the database to aid in this process. To this end, we propose a new paradigm, guided interaction, to solve the noted challenges, by using interaction to guide the user through the query formulation, query execution and result examination processes.

There are significant engineering challenges to constructing the system we envision, and the technological building blocks to address these challenges exist today 


  • Distinguished Lectures