We organize invited talks by distinguished people in academia and industry. The talks are usually held on alternate Fridays. This page provides the summary of past talks as well as upcoming talks.
Contact: Anitha Kannan
09/11/2009 (10:30am SVC-6/Titan)
Title: Agents and Cooperation in Social Tagging Systems
Speaker: Markus Strohmaier, Graz University of Technology, Austria
Abstract: Tags in social tagging systems contain what we can describe as “traces of user intent”, that is traces of user goals, motivations and intentions expressed in different degrees of explicitness. These traces enable researchers not only to study the semantic structure or the content of resources being tagged, but enables us also to study the goals, motivations and intentions of the (human) agents who are using these systems.
In this talk, I will present preliminary and ongoing work that focuses on two problems related to user intent in social tagging systems: 1) the automatic detection of users’ motivation for tagging 2) cooperation in social tagging systems.
I will begin by discussing ways to automatically detect users' motivation for tagging, and talk about the relevance and implications of user motivation in tagging systems. Subsequently, I will present results that suggest that a) tagging motivation can be measured in a content- and language-independent way and b) tagging motivation has an influence on the usefulness of tags for certain tasks such as information retrieval.
Second, I will talk about my current work at PARC on studying cooperation in social tagging systems. In this ongoing effort, we aim to understand the mechanisms and interactions that yield cooperative behavior among (human) agents in social tagging systems.
Bio: Markus Strohmaier is an Ass. Prof. at the Knowledge Management Institute at Graz University of Technology, Austria. From 03/2006 to 05/2007 he was a Post-Doctoral Research Fellow at the Department of Computer Science at University of Toronto, funded by an Erwin Schrödinger Postdoctoral Fellowship. Currently, he is a visiting scientist at PARC's Augmented Social Cognition Group. Markus is principal investigator of the national research grant "TransAgere" focusing on agents and goals on the social web (2007-2010). His research interests include user goals in the context of social media, search and knowledge management, focusing on the analysis and modeling of large-scale intentional web corpora and knowledge bases.
08/21/2009 (10:30am SVC-6/Titan)
Title: Output Bidding: A New Search Advertising Model Complementary to Keyword Bidding
Speaker: Ali Dasdan, Yahoo!
Abstract: We propose output bidding where advertisers bid to associate their advertisements with search results without changing or replacing any search result. I will talk about how output bidding complements the current search advertising model (keyword or input bidding). I will also discuss how output bidding can be implemented.
Bio: Ali Dasdan has been at Yahoo! since 2006. He manages a team doing all white-box metrics & monitoring as well as data/system analysis for web search. He has a PhD in Computer Science from UIUC. His home page at http://www.dasdan.net/ali/ has the most recent list of his research publications.
08/12/2009 (2pm SVC-6/Titan)
Title: Optional Polya Tree and Bayesian Inference
Speaker: Wing H Wong, Stanford University
Abstract: We introduce an extension of the Polya Tree approach for constructing distributions on the space of probability measures. By using optional stopping and optional choice of splitting variables, the construction gives rise to random measures that are absolutely continuous with piecewise smooth densities on partitions that can adapt to fit the data. The resulting "optional Polya tree" distribution has large support in total variation topology, and yields posterior distributions that are also optional Polya trees with computable parameter values. The theory covers both discrete and continuous multivariate distributions.
07/31/2009 (3pm SVC-6/Titan)
Title: Why everyone should be their own database administrator, UI designer, and Web 2.0 site developer, and how they can.
Title: David Karger, MIT and Microsoft
Abstract: The Web democratized content creation, turning millions of people into information producers who enthusiastically create information that benefits millions of information consumers. But the vast majority of such end-user-authored content is blobs of text, images, or video. In contrast, large-scale content providers offer richly structured data and Web-2.0 interfaces to it that support powerful filtering, sorting, visualization and exploration, making that information much more usable and useful.
In this talk I will demonstrate that the authoring of richly structured information, and of rich interactive visualizations of that information, can be put within reach of those same end users who have contributed so enthusiastically to the text web. Our approach is to abandon heavyweight notions of database management and user interface programming, and instead develop tools that support casual editing and mashups of rich data and visualizations in web pages, wikis, and blogs.
Our tools support the same rote copy/paste/modify flow of content creation that has so effectively let users create content for the text web without really understanding what they are doing.
07/24/2009 (10:30am SVC -6/Titan)
Title: Generating Bayes-Nash Equilibria to Design Autonomous Trading Agents
Speaker: Ioannis Vetsikas, Southampton University
Abstract: Online auctions have become a popular method for business transactions. The variety of different auction rules, the restrictions in supply or demand, and the agents’ combinatorial preferences for the different commodities, have led to the creation of a very complex multi-agent “environment”. Strategies designed for autonomous agents participating in such complex environments have often been ad-hoc based on the expert knowledge of a human specialist. In fact, this has been viewed as a necessary evil. This can be explained by the fact that a theoretical analysis and subsequent generation of strategies for such complex games is intractable.
To put game theory back into the design of trading agents for complex environments, I have proposed to decompose the general problem faced by the agent into several components, and restrict communication between them. This allows analyzing each component individually; some of the resulting sub problems are simple enough that equilibrium strategies can
be generated for these and then used in the overall strategy of the autonomous agent. To evaluate this approach, I used this analysis to generate strategies for the International Trading Agent Competition (TAC). I computed novel Bayes-Nash equilibria for first-price single-unit auctions and mth-price multi-unit auctions, when the auction has a set of possible closing times, one of which is chosen randomly for the auction to end at. A strategy inspired by this approach was used by my agent WhiteBear who participated in the competition. WhiteBear has won the competition a few times and has achieved the best overall performance in the competition among all participating agents.
Furthermore, in view of the fact that agents are in competition in this setting, I examine the behaviour of bidding agents that are in direct competition with the other participants in an auction setting. Thus the agents are not simply trying to maximize their own utility, rather they wish to maximize a weighted difference of their own gain to that of their competitors. I derive a symmetric Bayes-Nash equilibria for these agents in both m-th and (m+1)-th price sealed bid auctions. Subsequently, I use these equilibria to examine the profits of different agents and show that aiming to beat the competition is more effective than pure self interest in any competitive setting. Finally, I examine how the auctioneer’s revenue is affected and find that the weight that agents place in minimizing the opponents’ profit determines whether the m-th or the (m + 1)-th price auction yields a higher revenue.
Short bio: Ioannis Vetsikas received his Diploma (BSc with thesis) from NTU Athens, Greece and his MSc and PhD from Cornell University, USA. He is currently a senior research fellow at the IAM group of ECS School at Southampton University. His research interests lie in market-based systems, intelligent (trading) agents, auctions and related optimization issues, both from a theoretical and experimental standpoint, as well as applications.
06/26/2009 (10:30 am SVC-6/Titan)
Title: Efficient and Robust Processing of top-k Relational Queries.
Speaker: Alkis Polyzotis, UC Santa Cruz
Abstract: In the rank join problem, we are given a set of relations and a scoring function, and the goal is to return the join results with the top K scores. It is often the case in practice that the inputs may be accessed in ranked order and the scoring function is monotonic. These conditions allow for efficient algorithms that solve the rank join problem without reading all of the input.
This talk will present an overview of our recent work on the development of such rank join algorithms. We first demonstrate analytically that well known join algorithms do not stop reading their input as soon as possible, thus leading to bad performance. We find that it is NP-hard to overcome this weakness in the general case, but cases of limited query complexity are tractable. We show the latter with a rank join algorithm that infers provably tight bounds on the potential benefit of reading more input in order to stop as soon as possible. We also introduce a heuristic that enables the algorithm to deal gracefully with cases of high query complexity, where the computation of tight bounds is provably hard. Experimental results show that our algorithms outperform the state-of-the-art rank join algorithms by a significant margin, improving both I/O and total execution time.
Bio: Alkis Polyzotis is currently an assistant professor at UC Santa Cruz. His research focuses on database systems, and in particular on on-line database tuning, top-k queries, and P2P databases. He is the recipient of an NSF CAREER award in 2004 and of an IBM Faculty Award in 2005 and 2006. He received his PhD from the University of Wisconsin at Madison in 2003.
06/11/2009 (10:30 am SVC-6/Titan)
Title: Privacy into the Netflix Prize Contenders
Speaker: Ilya Mironov, Microsoft Research –SVC
Abstract: A recommender system based on collaborative filtering is a double-edged sword. By aggregating and processing preferences of multiple users it may provide personalized recommendations, serving an important business purpose. More disconcertingly, a recommender system is a potential source of leakage of private information shared by its users.
In the talk we will define differential privacy for recommender systems and briefly review leading approaches in the Netflix Prize competition. We will demonstrate how these approaches can be adapted to provide differential privacy, without significantly impairing their accuracy.
Joint work with Frank McSherry, MSR SVC, to appear in KDD'09.
Bio: Ilya Mironov is a researcher at Microsoft Research, Silicon Valley Campus. He works in the areas of cryptography, cryptanalysis, and privacy of statistical databases. Ilya received his PhD from Stanford in 2003, advised by Dan Boneh.
06/05/2009 (10:30 am SVC-6/Titan)
Title: Finding experts and initiators in social networks
Speaker: Evimaria Terzi, IBM Almaden
Abstract: The advent of online social networks and the need to exploit and analyze them has motivated lots of interesting research. In this talk, I will discuss two such problems and discuss the algorithmic challenges they pose.
In the first part of the talk, I will focus on the problem of expertise location in social networks: given a task that needs be performed and a pool of individuals organized in a social network, the problem asks for the subset of individuals in the network that can perform the input task with the least overhead. Different definitions of overhead lead to different formulations of the problem. In this talk, I will show how these definitions are related to existing NP-hard combinatorial problems and describe efficient approximation algorithms for solving them. I will also give some indicative experimental results that show the efficacy of our framework.
In the second part of the talk I will focus on a novel graph-reconstruction problem. Given a transaction database that shows which items were bought by different individuals, the task is to reconstruct the social network of individuals and find the initiating individuals for every product. I will give a formal definition of this problem and discuss algorithms for solving it.
I will also demonstrate the usefulness of our methods. These are joint works with T. Lappas, K. Liu and H. Mannila.
Bio: Evimaria Terzi has been a research scientist at IBM Almaden Research Center since June 2007. She obtained her PhD from the University of Helsinki (Finland) in January 2007, under the supervision of prof. Heikki Mannila. Before that, she got her MSc from Purdue University (USA) and her BSc from Aristotle University of Thessaloniki (Greece). Her research focuses on algorithmic aspects of data mining with applications in the analysis of sequential and graph data, ranking and clustering.
05/15/2009 (10:30 am SVC-6/Titan)
Title: SOFIE - A Self-Organizing Framework for Information Extraction
Speaker: Fabian Suchanek, Microsoft Research, Search Labs
Abstract: I will present SOFIE, a self-organizing framework for information extraction. SOFIE can extract facts from natural language documents, disambiguate the entities and relations, perform logical consistence checks with background knowledge and add the facts to an ontology. SOFIE does all of this in a unified framework, casting all of the subtasks into a weighted maximum satisfiability problem. In this context, I will also give a quick overview of the knowledge representation used in today's ontologies (RDFS) and show parallels to our commerce search Project.
Bio: Fabian studied Cognitive Science and received his Bachelor of Science in Osnabrueck/Germany. He did his Master of Science in Computer Science at the University of Saarbruecken/Germany. Fabian did his PhD at the Max-Planck Institute for Informatics under the supervision of Gerhard Weikum. He has worked in the areas of the Semantic Web, ontologies, natural language processing and information extraction. His main project was the YAGO ontology. Fabian recently joined the Search Labs of Microsoft Research in Silicon Valley, where he is working on the commerce section of Live search.
05/08/09 (10:30am, SVC-6/Titan)
Title: Learning to maximize expected ranking gain for information retrieval and collaborative filtering
Speaker: Rich Zemel, University of Toronto
Abstract: Ranking a set of retrieved documents according to their relevance to a query is a popular problem in information retrieval (IR). Methods that learn ranking functions are difficult to optimize, as ranking performance is typically judged by metrics that are not smooth. In collaborative filtering (CF) the problem has a different flavor. Instead of queries we now have users, and the items to rank are sets of songs, books, movies etc., with relevance levels given by ratings provided by users. Even though the primary goal of CF is to recommend a small set of items to users (a ranking problem), almost all CF methods treat this as a rating classification problem. The learning algorithms are thus optimizing the wrong objective, and one that is actually harder. We propose a new approach to learning to rank, creating a probability distribution over rankings assigned to documents for a given query, which permits gradient ascent optimization of the expected value of the target performance measure. Experimental results on benchmark IR and CF data sets demonstrate the strong performance of our method relative to existing learning approaches to ranking. Joint work with Maksims Volkovs and Ben Marlin.
Bio: Richard Zemel is a professor in the Computer Science Department at the University of Toronto. He is on sabbatical this year at Berkeley. He obtained a B.A. at Harvard University, and his Ph.D. in Computer Science from the University of Toronto. He was a postdoctoral fellow at the Salk Institute and at Carnegie Mellon University, and a faculty member at the University of Arizona. He has received several awards and honors, including a Young Investigator Award from the Office of Naval Research and three Dean's Excellence Awards at the University of Toronto, and was recently appointed a Fellow of the Canadian Institute for Advanced Research. His research interests include topics in machine learning, vision, and neural coding.
01/23/09 (10:30am, SVC-6/Titan)
Title: Information Overload: Models and Methods to Live with it
Speaker:Georgia Koutrika, Stanford University
Abstract: If knowledge is power then it comes to no surprise why most of our community research boils down to the fundamental question of how we can master data. Our problem is not that we lack data rather that we have TOO much of it. Information overload is an open problem further complicated by the number and diverse needs of information seekers. The problem is not just finding the right information but finding the right information for the right person. In this talk, I will present research efforts for supporting personalized search results and recommendations. I will discuss user preference modeling over structured data and methods for personalizing searches for a user given a profile that stores the user's preferences. Finally, I will present an approach that allows recommendations to be easily defined, personalized, and processed over structured data.
Bio: Georgia Koutrika is a postdoctoral researcher at the CS Department of Stanford University in USA. She has a PhD on computer science, a MSc in advanced information systems, and a BSc in informatics from the University of Athens' Department of Informatics and Telecommunications in Greece. Georgia is interested in how people can find, visualize and interact with information and services and she has worked on user behavior and preference modeling, personalization, information access and sharing, recommendations, spam, query processing and optimization. She is particularly intrigued by the potential and challenges of social sites and a lot of her recent research is delivered through CourseRank, a social site for course planning developed at and used in Stanford. Georgia has received distinguished awards for her PhD, MSc, and undergrad studies from the Greek Scholarship Foundation (I.K.Y.). She has also earned young researcher study fellowships from the Onassis Foundation and the European Network of Excellence in Digital Libraries (DELOS). She has served as a PC co-chair of three workshops (in conj. with UM’07, NLDB’08, and VLDB’08) and as a PC member in several conferences and workshops, including ICDE, WWW, EDBT, and ESWC.
11/04/08 (11:00am, SVC-6/Titan)
Fabian M. Suchanek
YAGO: A Core of Semantic Knowledge
This talk will present YAGO. YAGO is a large ontology, which currently contains more than 2 million entities and close to 20 million facts about them. The talk will explain how YAGO was constructed and how we represent YAGO's knowledge. It will also give an overview of our latest project, SOFIE, which uses logical reasoning to extract new information for YAGO from Web sources. The talk concludes with some examples for applications of YAGO, including semantic search and an analysis of social tagging.
Bio:
Fabian studied Cognitive Science and received his BSc in Osnabrueck/Germany. He did his MSc in Computer Science at the University of Saarbruecken/Germany. Currently, Fabian is doing his PhD at the Max-Planck Institute for Informatics under the supervision of Gerhard Weikum. Fabian just handed in his dissertation. He works in the areas of the Semantic Web, ontologies, natural language processing and information extraction. His main project is the YAGO ontology. During his PhD, Fabian spent 3 months at Microsoft Research Cambridge/UK, where he worked with Milan Vojnovic on Social Tagging.
08/28/08 (2:00pm, SVC-6/Titan)
Sunita Sarawagi
Structured learning models for information extraction
Feature-based structured models provide a flexible and elegant framework for various information extraction (IE) tasks. These include label sequences for traditional IE, segmentation models for entity-level extractions, and skip chain models for collective labeling. I will present efficient algorithms for finding the highest scoring prediction in graphs that arise when collectively labeling word sequences that are linked via associative edges.
Another challenging problem in structured learning is formulating training objectives that provide good generalization without requiring complicated inference during training. We review existing max-margin formulations and propose a new formulation that is better suited for decomposable error functions that arise in typical extraction tasks.
Bio:
Sunita Sarawagi researches in the fields of databases, data mining, and machine learning. Her current research interests are information integration, graphical and structured models, and probabilistic databases. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining and several patents. She serves on the board of directors of ACM SIGKDD and VLDB foundation. She is program chair for the ACM SIGKDD 2008 conference and has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE, and ICML conferences. She is on the editorial board of the ACM TODS, ACM TKDD, and FnT for machine learning journals.
08/28/08 (11:00am, SVC-6/Mir)
Soumen Chakrabarti
Structured Learning for Non-Smooth Ranking Losses
Learning to rank from relevance judgment is an active research area. Itemwise score regression, pairwise preference satisfaction, and listwise structured learning are the major techniques in use. Listwise structured learning has been applied recently to optimize important non-decomposable ranking criteria like AUC (area under ROC curve) and MAP (mean average precision). We propose new, almost-linear-time algorithms to optimize for two other criteria widely used to evaluate search systems: MRR (mean reciprocal rank) and NDCG (normalized discounted cumulative gain) in the max-margin structured learning framework. We also demonstrate that, for different ranking criteria, one may need to use different feature maps. Search applications should not be optimized in favor of a single criterion, because they need to cater to a variety of queries. E.g., MRR is best for navigational queries, while NDCG is best for informational queries. A key contribution of this paper is to fold multiple ranking loss functions into a multi-criteria max-margin optimization. The result is a single, robust ranking model that is close to the best accuracy of learners trained on individual criteria. In fact, experiments over the popular LETOR and TREC data sets show that, contrary to conventional wisdom, a test criterion is often not best served by training with the same individual criterion.
Bio:
Soumen Chakrabarti is Associate Professor of Computer Science at IIT-Bombay. He got a PhD from the University of California, Berkeley, in 1996, and was a Research Staff Member at IBM Almaden from 1996 to 1999. During Spring 2004 he was Visiting Associate Professor in the School of Computing, Carnegie-Mellon University. He has published extensively at conferences like WWW, SIGIR, EMNLP/HLT, SIGKDD, VLDB, and SIGMOD and also served frequently as vice chair or program committee member. His paper on Focused Crawling got the Best Paper award at WWW 1999, and others have been invited to Scientific American, IEEE Computer and VLDB Journal. As of 2006, his papers have over 1000 citations in CiteSeer. He has presented many tutorials at WWW, SIGMOD, VLDB and SIGKDD. These have led to a successful textbook, Mining the Web (http://www.cse.iitb.ac.in/~soumen/mining-the-web/), published in 2002, with a second edition currently in progress. He has eight granted US patents. At IIT Bombay, he has obtained research funding from Microsoft Research, IBM Research, GE Research, University of California, Tata Consultancy Services, and NEC Research, and delivered working software systems in several cases. For more details please visit http://www.cse.iitb.ac.in/~soumen/.
08/15/08 (1:30pm, SVC-6/Titan)
Qiaozhu Mei
Context Analysis in Text Mining and Search
The explosive growth of information demands powerful text mining tools to help us digest information and discover hidden knowledge in text. Text data is often associated with various types of context information, such as the time stamp of a news article, the geographical location where a blog article is written, the user who submitted a query to a search engine, and a social network of people. Given a collection of text data with context information, we often would like to extract the topics or themes from text and analyze their variational patterns over context. Such contextual patterns are very helpful for summarizing topics in text, as well as for providing a better targeted web search service. In this talk, we will present a general way of modeling and analyzing context in text with probabilistic models, and new algorithms for discovering and analyzing various contextual patterns, which we refer to as contextual text mining. The proposed models have broad applications in multiple domains to help understand topic evolutions, spatiotemporal impact of events, public opinions, detect topic related social communities, as well as utilize personalization in web search. We will discuss how to design concrete contextual topic models for different types of context, how to label probabilistic topic models, and how to regularize probabilistic models with harmonic functions defined on contexts.
Bio:
Qiaozhu Mei is a Ph.D. candidate of Department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests include text mining and information retrieval, especially context analysis in text mining and retrieval with probabilistic language models and topic models. He has published extensively in these areas, and has received the Best Student Paper Runner-Up Awards of ACM KDD 2006 and ACM KDD 2007. He is also one of the five recipients of the inaugural Yahoo! Ph.D. Student Fellowship.
Please refer to his homepage for more information:
08/15/08 (10:30am, SVC-6/Mir)
Tom Minka
On the Optimization of Evaluation Metrics
Suppose you want a search engine that performs well according to some IR effectiveness metric M, such as NDCG. You've collected some labeled training data to tune the ranker. What metric should you optimize on this training data? Should it also be M? This assumption is very common in IR, yet a growing body of experimental evidence says "no". In fact, it is well known in the machine learning literature that optimizing M on training data does not necessarily lead to the best M at test time. I will go through these arguments and explain what the correct training metric should be.
(Joint work with Stephen Robertson.)
This is based onmy tech report "Empirical Risk Minimization is an incomplete inductive principle".
Bio:
Tom Minka is a researcher in the Machine Learning and Perception Group at Microsoft Research Cambridge. His work focuses on probability propagation algorithms for graphical models. He is perhaps best known for developing the Expectation Propagation algorithm. He is also a co-inventor of the TrueSkill matchmaking algorithm used in Xbox Live. Previously, he was a Visiting Assistant Professor at Carnegie Mellon University. He completed his Ph.D. at the MIT Media Lab under the supervision of Rosalind Picard.
08/08/08 (1:30pm, SVC-6/Titan)
Heiko Stoermer (University of Trento, Italy)
OKKAM - A large-scale European Project for Entity-centric Information and Knowledge Management at Web Scale
Recent developments in the Web ecosphere reveal a common trend: keyword-based document search is becoming increasingly unsatisfactory in many domains and application scenarios. What can be observed are attempts to realize what we call "entity-centric" information and knowledge management, namely a change of paradigm in the management and access of information, away from a document-centric view, towards more precise and more structured approach in which (real-world) entities are the pivots of IKM. This trend manifests itself from two directions: on the one hand, we have the efforts of the Semantic Web community and their quest for web-scale Knowledge Representation by the masses; on the other hand, there is an increasing number of vertical domains and applications that center their information around entities such as geographical locations (Geonames, Yahoo), biomedical objects (LSID), publications (DOI), people/social networks (FOAF, Facebook, Xing, ...), and many others. Making these information "islands" usable in an integrated fashion is one of the main challenges that the Web is going to face in the near future.
In this talk I will argue that at the center of such a vision there is the need for good, re-usable entity identifiers as well as a scalable, open, and trustworthy backbone infrastructure for entity identifier management, which we call the Entity Name System (ENS). Only if we manage to foster global convergence on identifiers for entities will we be able to move the Web to a more structured and integrated source of knowledge that will allow for better, novel ways of web-scale IKM. From 2008 to 2010, the European Commission is funding the OKKAM large-scale Integrated Project which is pursuing precisely this goal. During the seminar, I will introduce motivations and use cases, give an overview of the OKKAM project, as well as illustrate in more details the current R&D challenges that we are addressing. For more details, see http://fp7.okkam.org.
Bio:
Heiko Stoermer works as a post-doctoral researcher in the area of Knowledge Management and Information Systems at the University of Trento, Italy. After studies in Linguistics he moved to the area of Computer Science, where he holds a German university degree. He gathered professional experience as a software consultant and project manager, and opted for an academic career in 2004 when he received a scholarship at the International Doctorate School in Trento. In 2008, he was awarded with a PhD for his work on Identity and Reference on the Semantic Web. His research interests include information integration, semantic interoperability and contextual knowledge representation. He has (co-) authored a number of scientific publications, acted as a reviewer for important international conferences and is co-organizer of several workshops. Currently, he is serving as UNITN's R&D Manager for the European large-scale Integrated Project OKKAM.
07/11/08 (1:30pm, SVC-6/Titan)
Jeff Jonas (IBM)
Beyond Search: The Data Must Find the Data and the Relevance Must Find the User
Why "search" is second fiddle to "the data must find the data and the relevance must find the user."
Bio:
Jeff Jonas is chief scientist of the Entity Analytic Solutions group and an IBM Distinguished Engineer. In these capacities, he is responsible for shaping the overall technical strategy of next generation identity analytics and the use of this new capability in the overall IBM technology strategy.
The IBM Entity Analytic Solutions group was formed based on technologies developed by Mr. Jonas as the founder and chief scientist of Systems Research & Development (SRD). SRD was acquired by IBM in January 2005.
Today, Mr. Jonas applies his real world and hands on experience in software design and development to drive technology innovations while delivering higher levels of privacy and civil liberties protections. By way of example, the most recent breakthrough developed by Mr. Jonas involves an innovative technique enabling advanced data correlation while only using irreversible cryptographic hashes. This new capability makes it possible for organizations to discover records of common interest (e.g., identities) without the transfer of any privacy invading content. This privacy-enhancing technology known as anonymous entity resolution delivers extraordinary new levels of privacy protection while enabling technology to contribute to critical societal interests like clinical health care research, aviation safety, homeland security, fraud detection and identity theft.
Jeff Jonas’s innovations have received coverage in such publications as The Wall Street Journal, Washington Post, Fortune, and Computerworld and have been featured on ABC Primetime with Peter Jennings, The Discovery Channel, The Learning Channel and MSNBC. Known for his dynamic presentational style, he is a popular speaker on technology, security and privacy and has spoken at events such as the Federal Convention on Emerging Technologies Forum on Homeland Security, National Security Agency’s INFOSEC Seminar Series, American Society for Industrial Security, Black Hat, PC Forum, Wharton Technology Conference, National Retail Federation Annual Fraud Conference and Computers, Freedom and Privacy Conference.
Mr. Jonas is a member of the Markle Foundation Task Force on National Security in the Information Age and actively contributes his insights on privacy, technology and homeland security to leading national think tanks, privacy advocacy groups, and policy research organizations, including the Center for Democracy and Technology, Heritage Foundation and the Office of the Secretary of Defense Highlands Forum. Mr. Jonas is a senior associate to the Center for Strategic and International Studies and is a member of the National Academies’ Committee on State Voter Registration Databases.
Jeff Jonas blogs at: http://jeffjonas.typepad.com
06/27/08 (1:30pm, SVC-6/Titan)
Umar Syed (Princeton University / Microsoft Research)
A Game-Theoretic Approach to Apprenticeship Learning
We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng (2004) who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire (1999) for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we provide an experimental demonstration of the algorithm on a toy video-game environment.
Joint work with Robert E. Schapire.
Bio:
Umar Syed is a fourth year Ph. D. student in the Computer Science department at Princeton University, advised by Rob Schapire. His research interests include reinforcement learning, online learning and game theory.
06/20/08 (9:30am, SVC-6/Titan)
Aaron Roth (CMU / Microsoft Research)
A Learning Theory Approach to Non-Interactive Database Privacy
We consider the problem of privately releasing a dataset so that others may usefully learn from it. Previous work shows that for some classes of queries, any dataset useful for all queries in that class would allow an adversary to reconstruct almost the entire dataset. However, we show that (at least if one ignores computational constraints), it is possible to privately release a dataset while still being useful for all queries natural for learning. More specifically, we show that for any class of functions C, it is possible to privately release a data set that is simultaneously useful for every function in C, given a sample of size only polynomial in the VC-dimension of C. We note that the sample complexity necessary for learning guarantees and privacy is only slightly larger than the sample complexity needed for learning alone, and so “anything that can be reliably learned from a dataset can be reliably learned from a privacy-preserving dataset of only slightly larger size”. We also give efficient data release mechanisms for some simple classes of functions, like axis-aligned rectangles of constant degree, and large margin half spaces.
This is joint work with Avrim Blum and Katrina Ligett.
Bio:
Aaron has just finished his second year as a PhD student at Carnegie Mellon, under the supervision of Avrim Blum. Currently, he is interning at MSR-SVC, working with Dr. Liad Blumrosen.
06/20/08 (9:30am, SVC-6/Titan)
Reid Andersen (Microsoft Research)
Local Graph Algorithms
Local graph algorithms perform small computations by adaptively exploring a small neighborhood of a graph near a specified target vertex. They are highly scalable, with running times that are essentially independent of the size of the graph. Local algorithms have been used to identify communities in web graphs, find link exchange sites, and extract high-value submarkets from advertiser-keyword spending graphs. In this talk, we describe a local algorithm for the following task: identifying the web pages that contribute the most to the PageRank score of a specified target page. We give bounds on the number of nodes that the algorithm must examine to approximately compute the top contributors. The ideal platform for implementing local algorithms is a link server that provides fast access to the link data of a set of adaptively selected pages. We describe an implementation of the PageRank contribution algorithm on LiveLink, a high-performance link server hosting a 20-billion node web graph snapshot.
Bio:
Reid Andersen is a postdoc in the theory group at MSR, Redmond. His current research focus is on local algorithms, algorithms for large bipartite graphs and web graphs, and graph partitioning and community identification.
Postponed
Sam Roweis (University of Toronto)
Making The Sky Searchable: Large Scale Astronomical Pattern Recognition
Imagine you have a picture of the night sky and you want to know where your telescope was pointing when the picture was taken. Several extremely large digital astronomy catalogues are available which hold (among other data) the positions and magnitudes of billions of stars covering the sky. So you should, in principle, be able to search these sky catalogues and find where your image matches. The only catch is that the sky is really big. However, if you are clever you can design a statistical pattern recognition system that solves this problem on a single workstation in less than a second. In collaboration with astronomers at NYU, my group has built such a universal astronomy search engine. Roughly speaking, this engine takes as input any picture of the night sky, and returns as output the location (and orientation) on the sky at which the picture was taken. Current "calibration" systems can do this kind of alignment only if the user provides a very good initial guess at position and orientation, but our system works even when no initial guess is available. In order to achieve this, we have built a massive geometric hashing system, which stores millions of "micro-features" and matches them in an inverted index against features found in a new test image. This matching process very rapidly generates candidate locations on the sky which we can then cheaply verify or refute. As a demonstration of the power of the system we have re-solved, completely blindly, the astrometry of every field ever taken by the Sloan Digital Sky Survey, solving over 99.9% of the approximately 400,000 images correctly with zero false positive matches. We have also used it to solve for the locations of astronomical images from historical plate archives, amateur astronomer pictures and figures from old research papers. In the future, we envision our engine being used for real-time verification of telescope software, to recover position information from all legacy data of astronomical images, and for "collaborative observing" between amateur and professional astronomers around the globe.
BONUS: If you email me an image I'll try to solve it with our system and show the results live during the talk.
Joint work with Dustin Lang, Keir Mierle, Michael Blanton and David Hogg.
Project Website: http://astrometry.net
Bio:
Sam Roweis is an Associate Professor in the Department of Computer Science at the University of Toronto. His research interests are in machine learning, data mining, and statistical signal processing. Roweis did his undergraduate degree at the University of Toronto in the Engineering Science program and earned his doctoral degree in 1999 from the California Institute of Technology working with John Hopfield. He did a postdoc with Geoff Hinton and Zoubin Ghahramani at the Gatsby Unit in London, and was a visiting faculty member at MIT in 2005. He has also worked at several industrial research labs including Google, Bell Labs, Whizbang! Labs and Microsoft. He is the holder of a Canada Research Chair in Statistical Machine Learning, a Sloan Research Fellowship, the winner of a Premier's Research Excellence Award, and a Scholar of the Canadian Institute for Advanced Research.
06/06/08 (1:30pm, SVC-6/Titan)
Tao Cheng (UIUC / Microsoft Research)
Entity Search: Search for what you what
As the Web has evolved into a data-rich repository, with the standard "page view", current search engines are increasingly inadequate. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. Towards searching directly for finding information of finer granularity, we propose the concept of entity search, a significant departure from traditional document retrieval. In particular, we focus on the core challenge of ranking entities, and develop EntityRank, a probabilistic model that integrates both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively. This presentation is mainly about the works published in CIDR 07, SIGMOD 07 (demo), and VLDB 07.
Highlight:
In addition to the presentation, a LIVE and INTERESTING demo over large-scale dataset will be given.
Bio:
Tao Cheng is a fourth year PhD student from the University of Illinois at Urbana-Champaign. His research interests are mainly about web search and mining. He works with Prof. Kevin Chang at UIUC on the WISDM project, proposing and studying a new paradigm of search called entity search – to search over the fine granularity entities embedded in web pages. Tao is a recipient of Yahoo 2007 Key Technical Challenge Award. Currently, he is interning at the Search Labs at Microsoft, working together with Dr. Stelios Paparizos.
05/27/08 (10:30am, SVC-6/Titan)
Gerhard Weikum (Max-Planck Institute for Informatics, Germany)
Harvesting, Searching, and Ranking Knowledge from the Web
There is a trend to advance the functionality of search engines to a more expressive semantic level. This is enabled by employing large-scale information extraction of entities and relationships from semistructured as well as natural-language Web sources. In addition, harnessing Semantic-Web-style ontologies and reaching into Deep-Web sources can contribute towards a grand vision of turning the Web into a comprehensive knowledge base that can be efficiently searched with high precision.
This talk presents ongoing research at the Max-Planck Institute for Informatics towards this objective, centered around the YAGO knowledge base and the NAGA search engine. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques.
This is joint work with Georgiana Ifrim, Gjergji Kasneci, Maya Ramanath, and Fabian Suchanek.
Short Bio:
Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also the spokesperson of the International Max-Planck Research School (IMPRS) for Computer Science. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany. He was elected as an ACM Fellow in 2005. He has also received CIDR 2005 Timeless Idea Award, VLDB 10-Year Award (2002), and ACM SIGMOD Conference 1998 Best Paper Award (for joint paper with David Lomet).
Homepage: http://www.mpi-inf.mpg.de/~weikum/
03/21/08 (1:30pm, SVC-6/Titan)
Ashish Goel (Stanford University)
Towards robust reputation and recommendation systems
Reputation and recommendation systems play an immense role in the success of the web. For example, search engines not just find but also rank information. And sites such as Netflix and Amazon don't just distribute movies/books but also make relevant recommendations. Both reputation and recommendation systems suffer from the problems of malicious spam, bad-mouthing of competitors, self-promotion via proxies, and inadequate feedback. We present algorithmic and incentive based approaches to making these systems robust. In particular, we present a ranking algorithm and an incentive mechanism that provides an arbitrage opportunity to users whenever a low quality entity is ranked above a high quality entity.
Social networks need to be robust not just to traditional security attacks but also to social misbehavior such as spam, bad-mouthing, and free-loading. On a more philosophical level, we make the case that algorithmic techniques alone will not provide social robustness, and we will also need incentives and other social primitives.
Bio:
Ashish Goel is an Associate Professor of Management Science and Engineering and (by courtesy) Computer Science at Stanford University, and a member of Stanford's Institute for Computational and Mathematical Engineering. He received his PhD in Computer Science from Stanford in 1999, and was an Assistant Professor of Computer Science at the University of Southern California from 1999 to 2002. His research interests lie in the design, analysis, and applications of algorithms. Professor Goel is a recipient of an Alfred P. Sloan faculty fellowship (2004-06), a Terman faculty fellowship from Stanford, and an NSF Career Award (2002-07).
Professor Goel is currently also the 3COM faculty development scholar in Stanford's School of Engineering.
02/26/08 (10:30am, SVC-6/Titan)
Joe Konstan
Help is Out There: Online Community, Community Artifacts, and a New Way of Harnessing Knowledge
Have a question? Go to wikipedia. Or browse a discipline-specific archive. Or even just post it online and hope people will help. This talk discusses several ongoing studies that are part of CommunityLab--a collaboration to understand how to design online communities to elicit participation. From them, we can draw lessons on how social psychology, economics, and other sciences of human behavior can be harnessed to understand and then design effective online communities.
Bio:
Joseph A. Konstan is Professor of Computer Science and Engineering at the University of Minnesota. His research addresses a variety of human-computer interaction issues, including personalization (particularly through recommender systems), eliciting on-line participation, and designing computer systems to improve public health. He is probably best known for his work in collaborative filtering recommenders (the GroupLens project), and for his work in online HIV prevention. He is co-author of Word of Mouse: The Marketing Power of Collaborative Filtering, a book that reviews three dozen good and poor examples of personalization in research and deployed systems.
Dr. Konstan received his Ph.D. from the University of California, Berkeley in 1993. He is an ACM Distinguished Scientist. Previously, he has served as an ACM Distinguished Lecturer and an IEEE Distinguished Visitor. Dr. Konstan served from 2003-2006 as President of ACM SIGCHI, the 4500-member Special Interest Group on Human-Computer Interaction. He currently serves as Chair of ACM’s Special Interest Group Governing Board, and as a member of ACM’s Executive Committee. Dr. Konstan is an active consultant who has worked for more than 15 companies on issues related to human-computer interaction, personalization, and general software issues. He has traveled and lectured extensively, giving over 200 talks in more than 25 countries worldwide.
02/05/08 (10:30am, SVC-6/Titan)
S.Sudarshan
Keyword Search on Graph-Structured Data
A variety of types of data, such as relational, XML and HTML data can be naturally represented using a graph model of data, with entities as nodes and relationships as edges. Graph representations are also a natural choice for representing information extracted from unstructured data, and for representing information integrated from heterogeneous sources of data.
Keyword search is a natural way to retrieve information from graph representations of data, especially in the common situation where the graphs do not have a well-defined schema. Unlike text or Web search, there is no natural notion of a document, and information about a single conceptual entity may be split across multiple nodes. Answers to queries are therefore usually modeled as trees that connect nodes matching the keywords, with each answer tree having an associated score. The goal of keyword search then is to find answers with the highest scores. A number of systems, including the BANKS system developed at IIT Bombay, are based on such a model for answering keyword queries.
In this talk we first outline background material on keyword querying on graph data, including models for ranking of answer trees. We then focus on search algorithms for finding top-ranked answers. We outline key issues in finding top-ranked answers, and present algorithms that address the problem in the context of in-memory graphs. We then consider the problem of search on external memory graphs, and briefly outline our on-going work on external memory graph search based on a multi-granular graph representation.
Bio:
S. Sudarshan received the Ph.D. from the Univ. of Wisconsin, Madison in 1992. He was a Member of the Technical Staff in the database research group at AT&T Bell Laboratories, from 1992 to 1995, and he has been at the Indian Institute of Technology (IIT), Bombay since 1995. He spent a year on sabbatical at Microsoft Research, USA in 2004-05. Sudarshan's research interests center on database systems, and his current research projects include keyword querying on databases, processing and optimization of complex queries, and database support for securing and testing applications. Sudarshan is a co-author of the widely used textbook, Database System Concepts, 5th Ed., by Silberschatz, Korth and Sudarshan.
10/19/07 (1:30pm, SVC-5/2200 Neptune)
Yuri Lifshits
Similarity Search: A Web Perspective
Similarity search is the problem of preprocessing a database of N objects in such a way that given a query object, one can effectively determine its nearest neighbors in database.
Similarity search is closely connected to many algorithmic problems in the web. In this talk we will focus on personalized news aggregation (searching for news articles that are most similar to the user's profile of interests) and behavioral targeting (searching for the most relevant user for displaying a given advertisement). Other web-applications include near-duplicate detection, (2) finding related pages, classiffcation/fitering tasks like detecting spam pages, spelling correction, sense disambiguation, computing co-occurrence similarity, social network analysis (e.g. suggesting new friends), recommendation systems.
We describe features that make web applications somewhat different from previously studied models. Thus we re-examine the formalization and the classical algorithms for similarity search. This leads us to new algorithms (we present two of them) and numerous open problems in the field.
Bio:
Yury Lifshits obtained his PhD degree from Steklov Institute of Mathematics at St.Petersburg (2007). He is currently a postdoc at Caltech. He won two gold medals at International Mathematical Olympiads, received The Best Paper Award in Application Track of CSR'07 and The Yandex Teaching Award (2006) for his course "Algorithms for Internet".
Yury Lifshits Homepage: http://yury.name
The Homepage of Nearest Neighbors (maintained by Yury): http://simsearch.yury.name
08/24/07 (1:30pm, SVC-5/2200 Neptune)
Chen Li
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
Many applications have the emerging need to answer approximate string queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. We will report our recent results on solving related problems. We will focus on a new technique, called VGRAM. It improves those algorithms that use fixed-length grams, which are substrings of a string used as signatures to identify similar strings. The main idea of VGRAM is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms. Related results have appeared in recent VLDB papers.
Bio:
Chen Li is an associate professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google. His research interests are in the fields of database and information systems, including data cleansing, data integration and sharing, data warehousing, and data privacy.
08/06/07 (1:30pm, SVC-5/2200 Neptune)
Juliana Fereire
Discovering and Organizing Hidden-Web Sources
The hidden Web consists of sites whose contents typically reside in databases and are only exposed on demand, as users fill out and submit forms. As the volume of information in the hidden Web grows, there is an increased need for techniques and tools that allow users and applications to uncover and leverage this information. Several applications attempt to make the hidden information more easily accessible, including metasearchers, hidden-Web crawlers, online database directories, and Web information integration systems. Since, for a particular domain of interest, there are many hidden-Web sources with varied coverage that need to be integrated or searched, a key requirement for these applications is the ability to locate relevant sources.
At the University of Utah, we are developing a a scalable infrastructure that automates, to a large extent, the process of discovering and organizing hidden-Web sources. In this talk, I will describe two important components of this infrastructure: an adaptive Web crawler that automatically learns patterns of promising links and adapts its focus as the crawl progresses substantially improving harvest rates, i.e., the number of relevant forms retrieved per pages crawled; and a classifier ensemble that combines different learning techniques to accurately identify forms as belonging to a given domain.
Joint work with Luciano Barbosa. This work has been partially funded by the National Science Foundation and a University of Utah Seed Grant.
Bio:
Juliana Freire joined the faculty of the School of Computing at the University of Utah in July 2005. Before, she was member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies) and an Assistant Professor at OGI/OHSU. Juliana's research has focused on extending traditional database technology and developing techniques to address new data management problems introduced by the Web and scientific applications. She is an active member of the database community, having co-authored over 60 technical papers and holding 4 U.S. patents. She has participated as a program committee member in over 40 events. She is a vice-chair of WWW2008; served as vice-chair for ICDE2007 and WWW2005; and co-chaired WebDB2003. Her research has been funded by grants from the National Science Foundation, Department of Energy and the University of Utah.
08/01/07 (1:30pm, SVC-5/2400 Gulf)
Nick Koudas
BlogScope: A system for the analysis of the Blogosphere
BlogScope (www.blogscope.net), is a system for the analysis of the Blogosphere currently under development at the University of Toronto. BlogScope is an information discovery and text analysis system that offers a set of unique features. Such features include, spatio-temporal analysis of blogs, flexible navigation of the Blogosphere through information bursts, keyword correlations and burst synopsis, as well as enhanced ranking functions for improved query answer relevance. I will describe the system in its current state, our ongoing work in progress and detail our future plans and research agenda. I will also describe SPIDER a flexible string matching system, outline its functionality and touch on the principles of the underlying technology.
If time permits I will show demos of both systems.
Bio:
Nick Koudas is a professor at the department of Computer Science at the University of Toronto. Prior to joining UofT he was a principal member of technical staff at ATT Research and an adjunct professor at Columbia University. He works on web data management, data analysis, performance and algorithms.
07/19/07 (3:00pm, SVC-5/2400 Gulf)
Soumen Chakrabarti
Ranking for keyword queries in entity-relationship graphs: Machine learning and performance issues
This talk In statistical machine learning, graphs are increasingly used as a powerful common representation to model complex interactions between random variables. In contrast, in the Web community, the predominant graph is the Web graph, and the predominant function served by the graph is PageRank-style ranking. The integration of Web, database and IR technologies promises a far more ambitious agenda of representing, searching and mining richer data---XML, tables, social networks, blogs---in the form of unified and integrated entity-relationship (E-R) graphs with typed nodes and edges, where nodes and edges can also have text and structured attributes. We are researching two facets of E-R graph search.
First, how to automate the design of ranking functions in E-R graph search, and how to extend relevance feedback from the IR world to the E-R graph world? In PageRank, the link graph fixes the parameters of a Markovian walk. Instead, here we are interested in an inverse problem: given partial orders between nodes to be satisfied by the PageRank vector, learn the random walk parameters so that the full PageRank vector would complete the partial order to generalize to unseen preferences. Obviously, this can be done only if we assume some form of smoothness or regularity in the preferences. We will discuss three formulations involving graph Laplacians and network circulations, give algorithms and quote some generalization guarantees.
Second, having learnt the Markov walk parameters, we must answer queries over large E-R graphs within interactive time. Unlike static PageRanking, this is nontrivial if parts of the graph are materialized only at query time, which is inevitable if words and attribute values are transient nodes in the graph. What sort of preprocessed indices can we build to assist interactive speed dynamic graph conductance queries?
Can we set up a graceful tradeoff between preprocessing time, index space, query speed, and result accuracy? Can we harness query log statistics to optimize our indices, as is common in relational databases? We will give top-k graph conductance ranking algorithms that satisfy all the above properties.
(Joint work with Alekh Agarwal, Manish Gupta, Amit Pathak and Kriti Puniyani. Supported partly by IBM Research and Microsoft Research.)
06/19/07 (1:30pm, SVC-5/2200 Neptune)
Geoff Webb
Finding the Real Patterns
Pattern discovery is one of the fundamental tasks in data mining. Pattern discovery typically explores a massive space of potential patterns to identify those that satisfy some user-specified criteria. This process entails a huge risk (in many cases a near certainty) that many patterns will be false discoveries. These are patterns that satisfy the specified criteria with respect to the sample data but do not satisfy those criteria with respect to the population from which those data are drawn. This talk discusses the problem of false discoveries, and presents techniques for avoiding them.
Bio:
Geoff Webb holds a research chair in the Faculty of Information Technology at Monash University. Prior to Monash he held appointments at Griffith University and then Deakin University where he received a personal chair. His primary research areas are machine learning, data mining, and user modelling. He is widely known for his contribution to the debate about the application of Occam's razor in machine learning and for the development of numerous algorithms and techniques for machine learning, data mining and user modelling. His commercial data mining software, Magnum Opus, is marketed internationally by Rulequest Research. He is editor-in-chief of the highest impact data mining journal, Data Mining and Knowledge Discovery and a member of the editorial boards of Machine Learning, ACM Transactions on Knowledge Discovery in Data, and User Modeling and User-Adapted Interaction.
06/15/07 (1:30pm, SVC-5/2200 Neptune)
Hao Chen
From Forum Spam to Syndication-based Spam Infrastructure
Forum spam has become a major means of search engine spam. To evaluate the impact of forum spam on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. We propose context-based analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75% and 54% respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1% of the blogs are spam. The observation on our honey forums confirms that most comment spam is meant to increase page rank rather than generate immediate traffic.
We have further observed that comment spam is often part of some syndication-based spam infrastructure, which connects spammers with advertisers using Web redirection. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords -- one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.
This is a joint work with Yi-Min Wang and Ming Ma at Microsoft Research, Redmond, and Yuan Niu and Francis Hsu at UC Davis.
Bio:
Hao Chen is an assistant professor at the University of California, Davis. He received his Ph.D. in Computer Science from the University of California, Berkeley in 2004. He has a broad range of interests in computer security, including Web security, cellular network security, firewall security, and software security. He received the National Science Foundation CAREER award in 2007.
05/25/07 (1:30pm, SVC-5/2200 Neptune)
Chris Olston
Navigation-Aided Retrieval
Users searching for information in hypermedia environments often perform *querying* followed by manual *navigation*. Yet, the conventional text/hypertext retrieval paradigm does not explicitly take post-query navigation into account. This talk proposes a new retrieval paradigm, called *navigation-aided retrieval* (NAR), which treats both querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting points for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the probabilistic information retrieval model.
Bio:
Christopher Olston is a senior research scientist at Yahoo! Research, after a stint as assistant professor at Carnegie Mellon University from 2003 to 2005. His research interests include data management and web search. Olston received his Ph.D. in 2003 from Stanford University, which is a really nice place to rollerblade.
05/18/07 (1:30pm, SVC-5/2200 Neptune)
Anthony Kosky
Exploring Heterogeneous Data bases and Web Data Sources with the OPM Multi-database Tools
The OPM Multi-database Query System was developed in the late 90's at Lawrence Berkeley National Lab, and later at Gene Logic, with collaboration from researchers at Glaxo Smith-Kline, as an extension of the Object Protocol Model (OPM) set of tools. The system was targeted at the primarily at the bioinformatics domain, and has been used internally within Gene Logic, for various projects within LBNL and as part of the TI-Net project at Glaxo Smith-Kline.
The OPM tools provided a basis for a federated database system in which the component data sources could be traditional relational databases, web data sources, XML or flat-file databases or external applications such as homology-search engines (BLAST). Data sources maintained independent schemas, but were viewed through a common data model (OPM) and supported a uniform query interface. A Java-applet based graphical user interface allowed users to navigate between data sources and dynamically create web query forms. The OPM Multi-database Query System included some additional novel features, such as the use of inter-database links for navigating between databases, and support for application specific data types (ASDTs), not found in other federated database systems, and other features, such as configurable query set specifications for web data sources, which have only recently appeared in the latest generation of mediated systems.
In this talk I will be providing an overview of the OPM multi-database tools, focusing on some of the novel and innovative features, and comparing the approach taken to the current generation of tools and approaches for federated database systems.
Bio:
Anthony Kosky received his PhD in Computer Science from the University of Pennsylvania in 1995. His thesis was entitled Transformating Databases with Recursive Data Structures, and looked at the interplay between data integration, database transformations and database constraints.
Subsequently he joined the Data Management Research and Development Group at Lawrence Berkeley National Laboratory in order to work on the OPM project to develop database development and query tools based on an object-relational mapping. In 1997 he joined Gene Logic Inc., together with other members of the OPM group, in order to commercialize the OPM tools for the bioinformatics and life sciences sectors. While at Gene Logic, he became involved in the GeneExpress and GX Connect projects to provide data management, exploration and analysis tools for large databases of gene expression and related genomics and clinical data.
Most recently he has been working for Axiope Inc., a spin-out of the University of Edinburgh, developing XML-based data management systems for bio-medical applications.
05/11/07 (1:30pm, SVC-5/2200 Neptune)
Ramakrishnan Akella
Towards Personalization and Service Platforms
We describe elements of our research as we consider developing Service Platforms for Collaborative Search. Our goal is to enable the more effective functioning of users who are looking at both web and internal document collections, with the intent to provide service to a set of demanding users with complex queries. We have designed user interactive models for information retrieval and document clustering. Two specific set of results we have obtained concern:
• Active Learning: We have incorporated a combination of diversity, retrieval engine scores and relevance feedback into an active learning algorithm. This algorithm significantly outperforms existing algorithms by up to 15% on TREC HARD data.
• Semi-supervised clustering: We have developed a more probabilistic model to incorporate must and must-not link constraints on documents in clusters. We develop a new semi-supervised clustering algorithm which builds on this, and Bayesian regularization. This algorithm outperforms those in the literature by 15% again on 20 News group data set.
We will also touch upon ongoing and proposed research:
• More principled models which build upon our current active learning model, including relevance feedback and subspace projection methods.
• Collaborative search, to share learning across users by analyzing the cooccurrence between user clusters, query clusters and web-page clusters.
• Social network analysis in enterprise search, discussing how interactions among customers and engineers can be used to improve search performance.
(Joint work with Zuobing Xu, Seinjuti Chakraborty, and Jyotsna Gangwar)
10/26/06 (3:00pm, SVC-5/2200 Neptune)
Sudarshan Murthy (Portland State University)
Sub-search: Searching at Sub-document Granularity
Search-engine users are often interested in finding relevant sub-documents (that is, document fragments), but search engines return document lists without much indication of why a returned document is relevant, and which sub-documents are relevant. Consequently, users perform intra-document searches to find relevant sub-documents on their own, and judge their relevance. For example, a user may employ the find function in a browser to examine the occurrences of a query term in an HTML document. This situation exists because current search systems essentially treat the document as both the unit of transfer and the unit of comprehension or consultation. Sub-search (that is, sub-document search) makes the sub-document a unit of comprehension or consultation.
Two improvements to current search systems can facilitate sub-search: First, enhance search results to provide insight into sub-documents that make a document relevant. Second, allow users to navigate directly from an enhanced search result to a sub-document of interest. In both cases, sub-documents should not be limited to those defined by document authors. Also, the improvements should work for any type of document. These improvements allow users to judge the relevance of sub-documents without opening documents, and reduce user-effort to view sub-documents in their original context.
We present one approach to sub-search based on the notion of superimposed information, and demonstrate a prototype implementation. We also outline alternative approaches, and introduce some ideas on aggregating and ranking sub-documents.
10/18/06 (2:00pm, SVC-5/2200 Neptune)
Heikki Mannila (Helsinki)
Algorithms for Discovering Bucket Orders from Data
We consider bucket orders, i.e., total orders with ties. They can be used to capture the essential order information without overfitting the data: they form a useful concept class between total orders and arbitrary partial orders. We address the question of finding a bucket order for a set of items, given pairwise precedence information between the items. We describe simple and efficient algorithms for finding good bucket orders. Several of the algorithms have a provable approximation guarantee, and they scale well to large datasets. We provide experimental results on artificial and a real data that show the usefulness of bucket orders and demonstrate the accuracy and efficiency of the algorithms.
09/26/06 (1:00pm, SVC-5/2200 Neptune)
Kwong Yung (IT.com)
Discovering Semantic Themes and Social Networks from Email Corpus
The Enron corpus of email messages is analyzed using several novel extensions of latent Dirichlet allocation, a Bayesian model of document generation. The body of an email message is viewed as the usual bag of words. In addition to the textual body, however, the author and the recipients of the email message also provide important clues about the content of the email message. Incorporating social network underlying the email corpus not only improves discovery of semantic themes in the corpus but also identifies roles of actors in the network. Leveraging both the semantic content and the social network underlying an email corpus makes information retrieval far more powerful. The standard LDA model is extended in two directions. In one direction, the author-recipient-topic model of McCallum 2005 is cast in a two-dimensional framework of horizontal and vertical segmentation of topics. In the other direction, the hierarchical model of Blei 2004 is relaxed by allowing multiple paths for each document. Combining these two threads of improvements on LDA results in advances on both theoretical and practical fronts.
01/23/09 (10:30am, SVC-6/Titan)
Title: Information Overload: Models and Methods to Live with it
Speaker:Georgia Koutrika, Stanford University
Abstract: If knowledge is power then it comes to no surprise why most of our community research boils down to the fundamental question of how we can master data. Our problem is not that we lack data rather that we have TOO much of it. Information overload is an open problem further complicated by the number and diverse needs of information seekers. The problem is not just finding the right information but finding the right information for the right person. In this talk, I will present research efforts for supporting personalized search results and recommendations. I will discuss user preference modeling over structured data and methods for personalizing searches for a user given a profile that stores the user's preferences. Finally, I will present an approach that allows recommendations to be easily defined, personalized, and processed over structured data.
Bio: Georgia Koutrika is a postdoctoral researcher at the CS Department of Stanford University in USA. She has a PhD on computer science, a MSc in advanced information systems, and a BSc in informatics from the University of Athens' Department of Informatics and Telecommunications in Greece. Georgia is interested in how people can find, visualize and interact with information and services and she has worked on user behavior and preference modeling, personalization, information access and sharing, recommendations, spam, query processing and optimization. She is particularly intrigued by the potential and challenges of social sites and a lot of her recent research is delivered through CourseRank, a social site for course planning developed at and used in Stanford. Georgia has received distinguished awards for her PhD, MSc, and undergrad studies from the Greek Scholarship Foundation (I.K.Y.). She has also earned young researcher study fellowships from the Onassis Foundation and the European Network of Excellence in Digital Libraries (DELOS). She has served as a PC co-chair of three workshops (in conj. with UM’07, NLDB’08, and VLDB’08) and as a PC member in several conferences and workshops, including ICDE, WWW, EDBT, and ESWC.



