*
Quick Links|Home|Worldwide
Microsoft*
Search for


Microsoft Research News & Highlights

Tech Talks

We organize invited talks by distinguished people in academia and industry. The talks are usually held on alternate Fridays. This page provides the summary of past talks as well as upcoming talks.

Contact: Krishnaram Kenthapadi

Date

Title

Speaker

08/28/08

Structured learning models for information extraction

Sunita Sarawagi

08/28/08

Structured Learning for Non-Smooth Ranking Losses

Soumen Chakrabarti

08/15/08

Context Analysis in Text Mining and Search

Qiaozhu Mei

08/15/08

On the Optimization of Evaluation Metrics

Tom Minka

08/08/08

OKKAM - A large-scale European Project for Entity-centric Information and Knowledge Management at Web Scale

Heiko Stoermer

07/11/08

Beyond Search: The Data Must Find the Data and the relevance Must Find the User

Jeff Jonas

06/27/08

A Game-Theoretic Approach to Apprenticeship Learning

Umar Syed

06/20/08

A Learning Theory Approach to Non-Interactive Database Privacy

Aaron Roth

06/20/08

Local Graph Algorithms

Reid Andersen

Postponed

Making The Sky Searchable: Large Scale Astronomical Pattern Recognition

Sam Roweis

06/09/08

Longer Queries, Better Answers?

Bruce Croft

06/06/08

Entity Search: Search for what you what

Tao Cheng

05/27/08

Harvesting, Searching, and Ranking Knowledge from the Web

Gerhard Weikum

05/19/08

Building scalable and fault-tolerant enterprise search platforms

Robbert Van Renesse

03/21/08

Towards robust reputation and recommendation systems

Ashish Goel

02/26/08

Help is Out There: Online Community, Community Artifacts, and a New Way of Harnessing Knowledge

Joe Konstan

02/05/08

Keyword Search on Graph-Structured Data

S.Sudarshan

12/03/07

Global Access to Information: Research Issues in Data Mining and Text Mining

Raj Reddy

10/23/07

Lessons in Engineering Self-Managed Networks

Bruce Maggs

10/19/07

Similarity Search: A Web Perspective

Yuri Lifshits

08/28/07

A Top View of Side Channel Attacks

Adi Shamir

08/24/07

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Chen Li

08/06/07

Discovering and Organizing Hidden-Web Sources

Juliana Freire

08/01/07

BlogScope: A system for the analysis of the Blogosphere

Nick Koudas

07/26/07

Algorithmic Mechanism Design

Noam Nisan

07/19/07

Ranking for keyword queries in entity-relationship graphs: Machine learning and performance issues

Soumen Chakrabarti

06/19/07

Finding the Real Patterns

Geoff Webb

06/15/07

From Forum Spam to Syndication-based Spam Infrastructure

Hao Chen

05/25/07

Navigation-Aided Retrieval

Chris Olston

05/18/07

Exploring Heterogeneous Data bases and Web Data Sources with the OPM Multi-database Tools

Anthony Kosky

05/11/07

Towards Personalization and Service Platforms

Ramakrishnan Akella

04/11/07

Learning to Analyze Sequences

Fernando Pereira

03/27/08

Universal Access to Human Knowledge (Or Public Access to Digital Materials)

Brewster Kahle

10/26/06

Sub-search: Searching at Sub-document Granularity

Sudarshan Murthy

10/18/06

Algorithms for Discovering Bucket Orders from Data

Heikki Mannila

09/26/06

Discovering Semantic Themes and Social Networks from Email Corpus

Kwong Yung

08/28/08 (2:00pm, SVC-6/Titan)

Sunita Sarawagi

Structured learning models for information extraction

Feature-based structured models provide a flexible and elegant framework for various information extraction (IE) tasks. These include label sequences for traditional IE, segmentation models for entity-level extractions, and skip chain models for collective labeling. I will present efficient algorithms for finding the highest scoring prediction in graphs that arise when collectively labeling word sequences that are linked via associative edges.

Another challenging problem in structured learning is formulating training objectives that provide good generalization without requiring complicated inference during training. We review existing max-margin formulations and propose a new formulation that is better suited for decomposable error functions that arise in typical extraction tasks.

Bio:

Sunita Sarawagi researches in the fields of databases, data mining, and machine learning. Her current research interests are information integration, graphical and structured models, and probabilistic databases. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining and several patents. She serves on the board of directors of ACM SIGKDD and VLDB foundation. She is program chair for the ACM SIGKDD 2008 conference and has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE, and ICML conferences. She is on the editorial board of the ACM TODS, ACM TKDD, and FnT for machine learning journals.

08/28/08 (11:00am, SVC-6/Mir)

Soumen Chakrabarti

Structured Learning for Non-Smooth Ranking Losses

Learning to rank from relevance judgment is an active research area. Itemwise score regression, pairwise preference satisfaction, and listwise structured learning are the major techniques in use. Listwise structured learning has been applied recently to optimize important non-decomposable ranking criteria like AUC (area under ROC curve) and MAP (mean average precision). We propose new, almost-linear-time algorithms to optimize for two other criteria widely used to evaluate search systems: MRR (mean reciprocal rank) and NDCG (normalized discounted cumulative gain) in the max-margin structured learning framework. We also demonstrate that, for different ranking criteria, one may need to use different feature maps. Search applications should not be optimized in favor of a single criterion, because they need to cater to a variety of queries. E.g., MRR is best for navigational queries, while NDCG is best for informational queries. A key contribution of this paper is to fold multiple ranking loss functions into a multi-criteria max-margin optimization. The result is a single, robust ranking model that is close to the best accuracy of learners trained on individual criteria. In fact, experiments over the popular LETOR and TREC data sets show that, contrary to conventional wisdom, a test criterion is often not best served by training with the same individual criterion.

Bio:

Soumen Chakrabarti is Associate Professor of Computer Science at IIT-Bombay. He got a PhD from the University of California, Berkeley, in 1996, and was a Research Staff Member at IBM Almaden from 1996 to 1999. During Spring 2004 he was Visiting Associate Professor in the School of Computing, Carnegie-Mellon University. He has published extensively at conferences like WWW, SIGIR, EMNLP/HLT, SIGKDD, VLDB, and SIGMOD and also served frequently as vice chair or program committee member. His paper on Focused Crawling got the Best Paper award at WWW 1999, and others have been invited to Scientific American, IEEE Computer and VLDB Journal. As of 2006, his papers have over 1000 citations in CiteSeer. He has presented many tutorials at WWW, SIGMOD, VLDB and SIGKDD. These have led to a successful textbook, Mining the Web (http://www.cse.iitb.ac.in/~soumen/mining-the-web/), published in 2002, with a second edition currently in progress. He has eight granted US patents. At IIT Bombay, he has obtained research funding from Microsoft Research, IBM Research, GE Research, University of California, Tata Consultancy Services, and NEC Research, and delivered working software systems in several cases. For more details please visit http://www.cse.iitb.ac.in/~soumen/.

08/15/08 (1:30pm, SVC-6/Titan)

Qiaozhu Mei

Context Analysis in Text Mining and Search

The explosive growth of information demands powerful text mining tools to help us digest information and discover hidden knowledge in text. Text data is often associated with various types of context information, such as the time stamp of a news article, the geographical location where a blog article is written, the user who submitted a query to a search engine, and a social network of people. Given a collection of text data with context information, we often would like to extract the topics or themes from text and analyze their variational patterns over context. Such contextual patterns are very helpful for summarizing topics in text, as well as for providing a better targeted web search service. In this talk, we will present a general way of modeling and analyzing context in text with probabilistic models, and new algorithms for discovering and analyzing various contextual patterns, which we refer to as contextual text mining. The proposed models have broad applications in multiple domains to help understand topic evolutions, spatiotemporal impact of events, public opinions, detect topic related social communities, as well as utilize personalization in web search. We will discuss how to design concrete contextual topic models for different types of context, how to label probabilistic topic models, and how to regularize probabilistic models with harmonic functions defined on contexts.

Bio:

Qiaozhu Mei is a Ph.D. candidate of Department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests include text mining and information retrieval, especially context analysis in text mining and retrieval with probabilistic language models and topic models. He has published extensively in these areas, and has received the Best Student Paper Runner-Up Awards of ACM KDD 2006 and ACM KDD 2007. He is also one of the five recipients of the inaugural Yahoo! Ph.D. Student Fellowship.

Please refer to his homepage for more information:

http://sifaka.cs.uiuc.edu/~qmei2/

08/15/08 (10:30am, SVC-6/Mir)

Tom Minka

On the Optimization of Evaluation Metrics

Suppose you want a search engine that performs well according to some IR effectiveness metric M, such as NDCG.  You've collected some labeled training data to tune the ranker.  What metric should you optimize on this training data?  Should it also be M?  This assumption is very common in IR, yet a growing body of experimental evidence says "no".  In fact, it is well known in the machine learning literature that optimizing M on training data does not necessarily lead to the best M at test time.  I will go through these arguments and explain what the correct training metric should be.

(Joint work with Stephen Robertson.)

This is based on my tech report "Empirical Risk Minimization is an incomplete inductive principle".

Bio:

Tom Minka is a researcher in the Machine Learning and Perception Group at Microsoft Research Cambridge.  His work focuses on probability propagation algorithms for graphical models.  He is perhaps best known for developing the Expectation Propagation algorithm.  He is also a co-inventor of the TrueSkill matchmaking algorithm used in Xbox Live.  Previously, he was a Visiting Assistant Professor at Carnegie Mellon University.  He completed his Ph.D. at the MIT Media Lab under the supervision of Rosalind Picard.

08/08/08 (1:30pm, SVC-6/Titan)

Heiko Stoermer (University of Trento, Italy)

OKKAM - A large-scale European Project for Entity-centric Information and Knowledge Management at Web Scale

Recent developments in the Web ecosphere reveal a common trend: keyword-based document search is becoming increasingly unsatisfactory in many domains and application scenarios. What can be observed are attempts to realize what we call "entity-centric" information and knowledge management, namely a change of paradigm in the management and access of information, away from a document-centric view, towards more precise and more structured approach in which (real-world) entities are the pivots of IKM. This trend manifests itself from two directions: on the one hand, we have the efforts of the Semantic Web community and their quest for web-scale Knowledge Representation by the masses; on the other hand, there is an increasing number of vertical domains and applications that center their information around entities such as geographical locations (Geonames, Yahoo), biomedical objects (LSID), publications (DOI), people/social networks (FOAF, Facebook, Xing, ...), and many others. Making these information "islands" usable in an integrated fashion is one of the main challenges that the Web is going to face in the near future.

In this talk I will argue that at the center of such a vision there is the need for good, re-usable entity identifiers as well as a scalable, open, and trustworthy backbone infrastructure for entity identifier management, which we call the Entity Name System (ENS). Only if we manage to foster global convergence on identifiers for entities will we be able to move the Web to a more structured and integrated source of knowledge that will allow for better, novel ways of web-scale IKM. From 2008 to 2010, the European Commission is funding the OKKAM large-scale Integrated Project which is pursuing precisely this goal. During the seminar, I will introduce motivations and use cases, give an overview of the OKKAM project, as well as illustrate in more details the current R&D challenges that we are addressing. For more details, see http://fp7.okkam.org.

Bio:

Heiko Stoermer works as a post-doctoral researcher in the area of Knowledge Management and Information Systems at the University of Trento, Italy. After studies in Linguistics he moved to the area of Computer Science, where he holds a German university degree. He gathered professional experience as a software consultant and project manager, and opted for an academic career in 2004 when he received a scholarship at the International Doctorate School in Trento. In 2008, he was awarded with a PhD for his work on Identity and Reference on the Semantic Web. His research interests include information integration, semantic interoperability and contextual knowledge representation. He has (co-) authored a number of scientific publications, acted as a reviewer for important international conferences and is co-organizer of several workshops. Currently, he is serving as UNITN's R&D Manager for the European large-scale Integrated Project OKKAM.

07/11/08 (1:30pm, SVC-6/Titan)

Jeff Jonas (IBM)

Beyond Search: The Data Must Find the Data and the Relevance Must Find the User

Why "search" is second fiddle to "the data must find the data and the relevance must find the user."

Bio:

Jeff Jonas is chief scientist of the Entity Analytic Solutions group and an IBM Distinguished Engineer. In these capacities, he is responsible for shaping the overall technical strategy of next generation identity analytics and the use of this new capability in the overall IBM technology strategy.

The IBM Entity Analytic Solutions group was formed based on technologies developed by Mr. Jonas as the founder and chief scientist of Systems Research & Development (SRD). SRD was acquired by IBM in January 2005.

Today, Mr. Jonas applies his real world and hands on experience in software design and development to drive technology innovations while delivering higher levels of privacy and civil liberties protections. By way of example, the most recent breakthrough developed by Mr. Jonas involves an innovative technique enabling advanced data correlation while only using irreversible cryptographic hashes. This new capability makes it possible for organizations to discover records of common interest (e.g., identities) without the transfer of any privacy invading content. This privacy-enhancing technology known as anonymous entity resolution delivers extraordinary new levels of privacy protection while enabling technology to contribute to critical societal interests like clinical health care research, aviation safety, homeland security, fraud detection and identity theft.

Jeff Jonas’s innovations have received coverage in such publications as The Wall Street Journal, Washington Post, Fortune, and Computerworld and have been featured on ABC Primetime with Peter Jennings, The Discovery Channel, The Learning Channel and MSNBC. Known for his dynamic presentational style, he is a popular speaker on technology, security and privacy and has spoken at events such as the Federal Convention on Emerging Technologies Forum on Homeland Security, National Security Agency’s INFOSEC Seminar Series, American Society for Industrial Security, Black Hat, PC Forum, Wharton Technology Conference, National Retail Federation Annual Fraud Conference and Computers, Freedom and Privacy Conference.

Mr. Jonas is a member of the Markle Foundation Task Force on National Security in the Information Age and actively contributes his insights on privacy, technology and homeland security to leading national think tanks, privacy advocacy groups, and policy research organizations, including the Center for Democracy and Technology, Heritage Foundation and the Office of the Secretary of Defense Highlands Forum. Mr. Jonas is a senior associate to the Center for Strategic and International Studies and is a member of the National Academies’ Committee on State Voter Registration Databases.

Jeff Jonas blogs at: http://jeffjonas.typepad.com

06/27/08 (1:30pm, SVC-6/Titan)

Umar Syed (Princeton University / Microsoft Research)

A Game-Theoretic Approach to Apprenticeship Learning

We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng (2004) who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire (1999) for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we provide an experimental demonstration of the algorithm on a toy video-game environment.

Joint work with Robert E. Schapire.

Bio:

Umar Syed is a fourth year Ph. D. student in the Computer Science department at Princeton University, advised by Rob Schapire. His research interests include reinforcement learning, online learning and game theory.

06/20/08 (9:30am, SVC-6/Titan)

Aaron Roth (CMU / Microsoft Research)

A Learning Theory Approach to Non-Interactive Database Privacy

We consider the problem of privately releasing a dataset so that others may usefully learn from it. Previous work shows that for some classes of queries, any dataset useful for all queries in that class would allow an adversary to reconstruct almost the entire dataset. However, we show that (at least if one ignores computational constraints), it is possible to privately release a dataset while still being useful for all queries natural for learning. More specifically, we show that for any class of functions C, it is possible to privately release a data set that is simultaneously useful for every function in C, given a sample of size only polynomial in the VC-dimension of C. We note that the sample complexity necessary for learning guarantees and privacy is only slightly larger than the sample complexity needed for learning alone, and so “anything that can be reliably learned from a dataset can be reliably learned from a privacy-preserving dataset of only slightly larger size”. We also give efficient data release mechanisms for some simple classes of functions, like axis-aligned rectangles of constant degree, and large margin half spaces.

This is joint work with Avrim Blum and Katrina Ligett.

Bio:

Aaron has just finished his second year as a PhD student at Carnegie Mellon, under the supervision of Avrim Blum. Currently, he is interning at MSR-SVC, working with Dr. Liad Blumrosen.

06/20/08 (9:30am, SVC-6/Titan)

Reid Andersen (Microsoft Research)

Local Graph Algorithms

Local graph algorithms perform small computations by adaptively exploring a small neighborhood of a graph near a specified target vertex. They are highly scalable, with running times that are essentially independent of the size of the graph. Local algorithms have been used to identify communities in web graphs, find link exchange sites, and extract high-value submarkets from advertiser-keyword spending graphs. In this talk, we describe a local algorithm for the following task: identifying the web pages that contribute the most to the PageRank score of a specified target page. We give bounds on the number of nodes that the algorithm must examine to approximately compute the top contributors. The ideal platform for implementing local algorithms is a link server that provides fast access to the link data of a set of adaptively selected pages. We describe an implementation of the PageRank contribution algorithm on LiveLink, a high-performance link server hosting a 20-billion node web graph snapshot.

Bio:

Reid Andersen is a postdoc in the theory group at MSR, Redmond. His current research focus is on local algorithms, algorithms for large bipartite graphs and web graphs, and graph partitioning and community identification.

Postponed

Sam Roweis (University of Toronto)

Making The Sky Searchable: Large Scale Astronomical Pattern Recognition

Imagine you have a picture of the night sky and you want to know where your telescope was pointing when the picture was taken. Several extremely large digital astronomy catalogues are available which hold (among other data) the positions and magnitudes of billions of stars covering the sky. So you should, in principle, be able to search these sky catalogues and find where your image matches. The only catch is that the sky is really big. However, if you are clever you can design a statistical pattern recognition system that solves this problem on a single workstation in less than a second. In collaboration with astronomers at NYU, my group has built such a universal astronomy search engine. Roughly speaking, this engine takes as input any picture of the night sky, and returns as output the location (and orientation) on the sky at which the picture was taken. Current "calibration" systems can do this kind of alignment only if the user provides a very good initial guess at position and orientation, but our system works even when no initial guess is available. In order to achieve this, we have built a massive geometric hashing system, which stores millions of "micro-features" and matches them in an inverted index against features found in a new test image. This matching process very rapidly generates candidate locations on the sky which we can then cheaply verify or refute. As a demonstration of the power of the system we have re-solved, completely blindly, the astrometry of every field ever taken by the Sloan Digital Sky Survey, solving over 99.9% of the approximately 400,000 images correctly with zero false positive matches. We have also used it to solve for the locations of astronomical images from historical plate archives, amateur astronomer pictures and figures from old research papers. In the future, we envision our engine being used for real-time verification of telescope software, to recover position information from all legacy data of astronomical images, and for "collaborative observing" between amateur and professional astronomers around the globe.

BONUS: If you email me an image I'll try to solve it with our system and show the results live during the talk.

Joint work with Dustin Lang, Keir Mierle, Michael Blanton and David Hogg.

Project Website: http://astrometry.net

Bio:

Sam Roweis is an Associate Professor in the Department of Computer Science at the University of Toronto. His research interests are in machine learning, data mining, and statistical signal processing. Roweis did his undergraduate degree at the University of Toronto in the Engineering Science program and earned his doctoral degree in 1999 from the California Institute of Technology working with John Hopfield. He did a postdoc with Geoff Hinton and Zoubin Ghahramani at the Gatsby Unit in London, and was a visiting faculty member at MIT in 2005. He has also worked at several industrial research labs including Google, Bell Labs, Whizbang! Labs and Microsoft. He is the holder of a Canada Research Chair in Statistical Machine Learning, a Sloan Research Fellowship, the winner of a Premier's Research Excellence Award, and a Scholar of the Canadian Institute for Advanced Research.

http://www.cs.toronto.edu/~roweis/bio.html

06/06/08 (1:30pm, SVC-6/Titan)

Tao Cheng (UIUC / Microsoft Research)

Entity Search: Search for what you what

As the Web has evolved into a data-rich repository, with the standard "page view", current search engines are increasingly inadequate. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. Towards searching directly for finding information of finer granularity, we propose the concept of entity search, a significant departure from traditional document retrieval. In particular, we focus on the core challenge of ranking entities, and develop EntityRank, a probabilistic model that integrates both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively. This presentation is mainly about the works published in CIDR 07, SIGMOD 07 (demo), and VLDB 07.

Highlight:

In addition to the presentation, a LIVE and INTERESTING demo over large-scale dataset will be given.

Bio:

Tao Cheng is a fourth year PhD student from the University of Illinois at Urbana-Champaign. His research interests are mainly about web search and mining. He works with Prof. Kevin Chang at UIUC on the WISDM project, proposing and studying a new paradigm of search called entity search – to search over the fine granularity entities embedded in web pages. Tao is a recipient of Yahoo 2007 Key Technical Challenge Award. Currently, he is interning at the Search Labs at Microsoft, working together with Dr. Stelios Paparizos.

05/27/08 (10:30am, SVC-6/Titan)

Gerhard Weikum (Max-Planck Institute for Informatics, Germany)

Harvesting, Searching, and Ranking Knowledge from the Web

There is a trend to advance the functionality of search engines to a more expressive semantic level. This is enabled by employing large-scale information extraction of entities and relationships from semistructured as well as natural-language Web sources. In addition, harnessing Semantic-Web-style ontologies and reaching into Deep-Web sources can contribute towards a grand vision of turning the Web into a comprehensive knowledge base that can be efficiently searched with high precision.

This talk presents ongoing research at the Max-Planck Institute for Informatics towards this objective, centered around the YAGO knowledge base and the NAGA search engine. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques.

This is joint work with Georgiana Ifrim, Gjergji Kasneci, Maya Ramanath, and Fabian Suchanek.

Short Bio:

Gerhard Weikum is a Research Director at the Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also the spokesperson of the International Max-Planck Research School (IMPRS) for Computer Science. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He received his diploma and doctoral degrees from the University of Darmstadt, Germany. He was elected as an ACM Fellow in 2005. He has also received CIDR 2005 Timeless Idea Award, VLDB 10-Year Award (2002), and ACM SIGMOD Conference 1998 Best Paper Award (for joint paper with David Lomet).

Homepage: http://www.mpi-inf.mpg.de/~weikum/

03/21/08 (1:30pm, SVC-6/Titan)

Ashish Goel (Stanford University)

Towards robust reputation and recommendation systems

Reputation and recommendation systems play an immense role in the success of the web. For example, search engines not just find but also rank information. And sites such as Netflix and Amazon don't just distribute movies/books but also make relevant recommendations. Both reputation and recommendation systems suffer from the problems of malicious spam, bad-mouthing of competitors, self-promotion via proxies, and inadequate feedback. We present algorithmic and incentive based approaches to making these systems robust. In particular, we present a ranking algorithm and an incentive mechanism that provides an arbitrage opportunity to users whenever a low quality entity is ranked above a high quality entity.

Social networks need to be robust not just to traditional security attacks but also to social misbehavior such as spam, bad-mouthing, and free-loading. On a more philosophical level, we make the case that algorithmic techniques alone will not provide social robustness, and we will also need incentives and other social primitives.

Bio:

Ashish Goel is an Associate Professor of Management Science and Engineering and (by courtesy) Computer Science at Stanford University, and a member of Stanford's Institute for Computational and Mathematical Engineering. He received his PhD in Computer Science from Stanford in 1999, and was an Assistant Professor of Computer Science at the University of Southern California from 1999 to 2002. His research interests lie in the design, analysis, and applications of algorithms. Professor Goel is a recipient of an Alfred P. Sloan faculty fellowship (2004-06), a Terman faculty fellowship from Stanford, and an NSF Career Award (2002-07).

Professor Goel is currently also the 3COM faculty development scholar in Stanford's School of Engineering.

Photo: http://www.stanford.edu/~ashishg/ashishg.jpg

URL: http://www.stanford.edu/~ashishg/

02/26/08 (10:30am, SVC-6/Titan)

Joe Konstan

Help is Out There: Online Community, Community Artifacts, and a New Way of Harnessing Knowledge

Have a question? Go to wikipedia. Or browse a discipline-specific archive. Or even just post it online and hope people will help. This talk discusses several ongoing studies that are part of CommunityLab--a collaboration to understand how to design online communities to elicit participation. From them, we can draw lessons on how social psychology, economics, and other sciences of human behavior can be harnessed to understand and then design effective online communities.

Bio:

Joseph A. Konstan is Professor of Computer Science and Engineering at the University of Minnesota. His research addresses a variety of human-computer interaction issues, including personalization (particularly through recommender systems), eliciting on-line participation, and designing computer systems to improve public health. He is probably best known for his work in collaborative filtering recommenders (the GroupLens project), and for his work in online HIV prevention. He is co-author of Word of Mouse: The Marketing Power of Collaborative Filtering, a book that reviews three dozen good and poor examples of personalization in research and deployed systems.

Dr. Konstan received his Ph.D. from the University of California, Berkeley in 1993. He is an ACM Distinguished Scientist. Previously, he has served as an ACM Distinguished Lecturer and an IEEE Distinguished Visitor. Dr. Konstan served from 2003-2006 as President of ACM SIGCHI, the 4500-member Special Interest Group on Human-Computer Interaction. He currently serves as Chair of ACM’s Special Interest Group Governing Board, and as a member of ACM’s Executive Committee. Dr. Konstan is an active consultant who has worked for more than 15 companies on issues related to human-computer interaction, personalization, and general software issues. He has traveled and lectured extensively, giving over 200 talks in more than 25 countries worldwide.

02/05/08 (10:30am, SVC-6/Titan)

S.Sudarshan

Keyword Search on Graph-Structured Data

A variety of types of data, such as relational, XML and HTML data can be naturally represented using a graph model of data, with entities as nodes and relationships as edges. Graph representations are also a natural choice for representing information extracted from unstructured data, and for representing information integrated from heterogeneous sources of data.

Keyword search is a natural way to retrieve information from graph representations of data, especially in the common situation where the graphs do not have a well-defined schema. Unlike text or Web search, there is no natural notion of a document, and information about a single conceptual entity may be split across multiple nodes. Answers to queries are therefore usually modeled as trees that connect nodes matching the keywords, with each answer tree having an associated score. The goal of keyword search then is to find answers with the highest scores. A number of systems, including the BANKS system developed at IIT Bombay, are based on such a model for answering keyword queries.

In this talk we first outline background material on keyword querying on graph data, including models for ranking of answer trees. We then focus on search algorithms for finding top-ranked answers. We outline key issues in finding top-ranked answers, and present algorithms that address the problem in the context of in-memory graphs. We then consider the problem of search on external memory graphs, and briefly outline our on-going work on external memory graph search based on a multi-granular graph representation.

Bio:

S. Sudarshan received the Ph.D. from the Univ. of Wisconsin, Madison in 1992. He was a Member of the Technical Staff in the database research group at AT&T Bell Laboratories, from 1992 to 1995, and he has been at the Indian Institute of Technology (IIT), Bombay since 1995. He spent a year on sabbatical at Microsoft Research, USA in 2004-05. Sudarshan's research interests center on database systems, and his current research projects include keyword querying on databases, processing and optimization of complex queries, and database support for securing and testing applications. Sudarshan is a co-author of the widely used textbook, Database System Concepts, 5th Ed., by Silberschatz, Korth and Sudarshan.

10/19/07 (1:30pm, SVC-5/2200 Neptune)

Yuri Lifshits

Similarity Search: A Web Perspective

Similarity search is the problem of preprocessing a database of N objects in such a way that given a query object, one can effectively determine its nearest neighbors in database.

Similarity search is closely connected to many algorithmic problems in the web. In this talk we will focus on personalized news aggregation (searching for news articles that are most similar to the user's profile of interests) and behavioral targeting (searching for the most relevant user for displaying a given advertisement). Other web-applications include near-duplicate detection, (2) finding related pages, classiffcation/fitering tasks like detecting spam pages, spelling correction, sense disambiguation, computing co-occurrence similarity, social network analysis (e.g. suggesting new friends), recommendation systems.

We describe features that make web applications somewhat different from previously studied models. Thus we re-examine the formalization and the classical algorithms for similarity search. This leads us to new algorithms (we present two of them) and numerous open problems in the field.

Bio:

Yury Lifshits obtained his PhD degree from Steklov Institute of Mathematics at St.Petersburg (2007). He is currently a postdoc at Caltech. He won two gold medals at International Mathematical Olympiads, received The Best Paper Award in Application Track of CSR'07 and The Yandex Teaching Award (2006) for his course "Algorithms for Internet".

Yury Lifshits Homepage: http://yury.name

The Homepage of Nearest Neighbors (maintained by Yury): http://simsearch.yury.name

08/24/07 (1:30pm, SVC-5/2200 Neptune)

Chen Li

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Many applications have the emerging need to answer approximate string queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. We will report our recent results on solving related problems. We will focus on a new technique, called VGRAM. It improves those algorithms that use fixed-length grams, which are substrings of a string used as signatures to identify similar strings. The main idea of VGRAM is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms. Related results have appeared in recent VLDB papers.

Bio:

Chen Li is an associate professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google. His research interests are in the fields of database and information systems, including data cleansing, data integration and sharing, data warehousing, and data privacy.

08/06/07 (1:30pm, SVC-5/2200 Neptune)

Juliana Fereire

Discovering and Organizing Hidden-Web Sources

The hidden Web consists of sites whose contents typically reside in databases and are only exposed on demand, as users fill out and submit forms. As the volume of information in the hidden Web grows, there is an increased need for techniques and tools that allow users and applications to uncover and leverage this information. Several applications attempt to make the hidden information more easily accessible, including metasearchers, hidden-Web crawlers, online database directories, and Web information integration systems. Since, for a particular domain of interest, there are many hidden-Web sources with varied coverage that need to be integrated or searched, a key requirement for these applications is the ability to locate relevant sources.

At the University of Utah, we are developing a a scalable infrastructure that automates, to a large extent, the process of discovering and organizing hidden-Web sources. In this talk, I will describe two important components of this infrastructure: an adaptive Web crawler that automatically learns patterns of promising links and adapts its focus as the crawl progresses substantially improving harvest rates, i.e., the number of relevant forms retrieved per pages crawled; and a classifier ensemble that combines different learning techniques to accurately identify forms as belonging to a given domain.

Joint work with Luciano Barbosa. This work has been partially funded by the National Science Foundation and a University of Utah Seed Grant.

Bio:

Juliana Freire joined the faculty of the School of Computing at the University of Utah in July 2005. Before, she was member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies) and an Assistant Professor at OGI/OHSU. Juliana's research has focused on extending traditional database technology and developing techniques to address new data management problems introduced by the Web and scientific applications. She is an active member of the database community, having co-authored over 60 technical papers and holding 4 U.S. patents. She has participated as a program committee member in over 40 events. She is a vice-chair of WWW2008; served as vice-chair for ICDE2007 and WWW2005; and co-chaired WebDB2003. Her research has been funded by grants from the National Science Foundation, Department of Energy and the University of Utah.

08/01/07 (1:30pm, SVC-5/2400 Gulf)

Nick Koudas

BlogScope: A system for the analysis of the Blogosphere

BlogScope (www.blogscope.net), is a system for the analysis of the Blogosphere currently under development at the University of Toronto. BlogScope is an information discovery and text analysis system that offers a set of unique features. Such features include, spatio-temporal analysis of blogs, flexible navigation of the Blogosphere through information bursts, keyword correlations and burst synopsis, as well as enhanced ranking functions for improved query answer relevance. I will describe the system in its current state, our ongoing work in progress and detail our future plans and research agenda. I will also describe SPIDER a flexible string matching system, outline its functionality and touch on the principles of the underlying technology.

If time permits I will show demos of both systems.

Bio:

Nick Koudas is a professor at the department of Computer Science at the University of Toronto. Prior to joining UofT he was a principal member of technical staff at ATT Research and an adjunct professor at Columbia University. He works on web data management, data analysis, performance and algorithms.

07/19/07 (3:00pm, SVC-5/2400 Gulf)

Soumen Chakrabarti

Ranking for keyword queries in entity-relationship graphs: Machine learning and performance issues

This talk In statistical machine learning, graphs are increasingly used as a powerful common representation to model complex interactions between random variables. In contrast, in the Web community, the predominant graph is the Web graph, and the predominant function served by the graph is PageRank-style ranking. The integration of Web, database and IR technologies promises a far more ambitious agenda of representing, searching and mining richer data---XML, tables, social networks, blogs---in the form of unified and integrated entity-relationship (E-R) graphs with typed nodes and edges, where nodes and edges can also have text and structured attributes. We are researching two facets of E-R graph search.

First, how to automate the design of ranking functions in E-R graph search, and how to extend relevance feedback from the IR world to the E-R graph world? In PageRank, the link graph fixes the parameters of a Markovian walk. Instead, here we are interested in an inverse problem: given partial orders between nodes to be satisfied by the PageRank vector, learn the random walk parameters so that the full PageRank vector would complete the partial order to generalize to unseen preferences. Obviously, this can be done only if we assume some form of smoothness or regularity in the preferences. We will discuss three formulations involving graph Laplacians and network circulations, give algorithms and quote some generalization guarantees.

Second, having learnt the Markov walk parameters, we must answer queries over large E-R graphs within interactive time. Unlike static PageRanking, this is nontrivial if parts of the graph are materialized only at query time, which is inevitable if words and attribute values are transient nodes in the graph. What sort of preprocessed indices can we build to assist interactive speed dynamic graph conductance queries?

Can we set up a graceful tradeoff between preprocessing time, index space, query speed, and result accuracy? Can we harness query log statistics to optimize our indices, as is common in relational databases? We will give top-k graph conductance ranking algorithms that satisfy all the above properties.

(Joint work with Alekh Agarwal, Manish Gupta, Amit Pathak and Kriti Puniyani. Supported partly by IBM Research and Microsoft Research.)

Bio: http://www.cse.iitb.ac.in/~soumen/main/bio.html

06/19/07 (1:30pm, SVC-5/2200 Neptune)

Geoff Webb

Finding the Real Patterns

Pattern discovery is one of the fundamental tasks in data mining. Pattern discovery typically explores a massive space of potential patterns to identify those that satisfy some user-specified criteria. This process entails a huge risk (in many cases a near certainty) that many patterns will be false discoveries. These are patterns that satisfy the specified criteria with respect to the sample data but do not satisfy those criteria with respect to the population from which those data are drawn. This talk discusses the problem of false discoveries, and presents techniques for avoiding them.

Bio:

Geoff Webb holds a research chair in the Faculty of Information Technology at Monash University. Prior to Monash he held appointments at Griffith University and then Deakin University where he received a personal chair. His primary research areas are machine learning, data mining, and user modelling. He is widely known for his contribution to the debate about the application of Occam's razor in machine learning and for the development of numerous algorithms and techniques for machine learning, data mining and user modelling. His commercial data mining software, Magnum Opus, is marketed internationally by Rulequest Research. He is editor-in-chief of the highest impact data mining journal, Data Mining and Knowledge Discovery and a member of the editorial boards of Machine Learning, ACM Transactions on Knowledge Discovery in Data, and User Modeling and User-Adapted Interaction.

06/15/07 (1:30pm, SVC-5/2200 Neptune)

Hao Chen

From Forum Spam to Syndication-based Spam Infrastructure

Forum spam has become a major means of search engine spam. To evaluate the impact of forum spam on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. We propose context-based analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75% and 54% respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1% of the blogs are spam. The observation on our honey forums confirms that most comment spam is meant to increase page rank rather than generate immediate traffic.

We have further observed that comment spam is often part of some syndication-based spam infrastructure, which connects spammers with advertisers using Web redirection. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords -- one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

This is a joint work with Yi-Min Wang and Ming Ma at Microsoft Research, Redmond, and Yuan Niu and Francis Hsu at UC Davis.

Bio:

Hao Chen is an assistant professor at the University of California, Davis. He received his Ph.D. in Computer Science from the University of California, Berkeley in 2004. He has a broad range of interests in computer security, including Web security, cellular network security, firewall security, and software security. He received the National Science Foundation CAREER award in 2007.

05/25/07 (1:30pm, SVC-5/2200 Neptune)

Chris Olston

Navigation-Aided Retrieval

Users searching for information in hypermedia environments often perform *querying* followed by manual *navigation*. Yet, the conventional text/hypertext retrieval paradigm does not explicitly take post-query navigation into account. This talk proposes a new retrieval paradigm, called *navigation-aided retrieval* (NAR), which treats both querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting points for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the probabilistic information retrieval model.

Bio:

Christopher Olston is a senior research scientist at Yahoo! Research, after a stint as assistant professor at Carnegie Mellon University from 2003 to 2005. His research interests include data management and web search. Olston received his Ph.D. in 2003 from Stanford University, which is a really nice place to rollerblade.

05/18/07 (1:30pm, SVC-5/2200 Neptune)

Anthony Kosky

Exploring Heterogeneous Data bases and Web Data Sources with the OPM Multi-database Tools

The OPM Multi-database Query System was developed in the late 90's at Lawrence Berkeley National Lab, and later at Gene Logic, with collaboration from researchers at Glaxo Smith-Kline, as an extension of the Object Protocol Model (OPM) set of tools. The system was targeted at the primarily at the bioinformatics domain, and has been used internally within Gene Logic, for various projects within LBNL and as part of the TI-Net project at Glaxo Smith-Kline.

The OPM tools provided a basis for a federated database system in which the component data sources could be traditional relational databases, web data sources, XML or flat-file databases or external applications such as homology-search engines (BLAST). Data sources maintained independent schemas, but were viewed through a common data model (OPM) and supported a uniform query interface. A Java-applet based graphical user interface allowed users to navigate between data sources and dynamically create web query forms. The OPM Multi-database Query System included some additional novel features, such as the use of inter-database links for navigating between databases, and support for application specific data types (ASDTs), not found in other federated database systems, and other features, such as configurable query set specifications for web data sources, which have only recently appeared in the latest generation of mediated systems.

In this talk I will be providing an overview of the OPM multi-database tools, focusing on some of the novel and innovative features, and comparing the approach taken to the current generation of tools and approaches for federated database systems.

Bio:

Anthony Kosky received his PhD in Computer Science from the University of Pennsylvania in 1995. His thesis was entitled Transformating Databases with Recursive Data Structures, and looked at the interplay between data integration, database transformations and database constraints.

Subsequently he joined the Data Management Research and Development Group at Lawrence Berkeley National Laboratory in order to work on the OPM project to develop database development and query tools based on an object-relational mapping. In 1997 he joined Gene Logic Inc., together with other members of the OPM group, in order to commercialize the OPM tools for the bioinformatics and life sciences sectors. While at Gene Logic, he became involved in the GeneExpress and GX Connect projects to provide data management, exploration and analysis tools for large databases of gene expression and related genomics and clinical data.

Most recently he has been working for Axiope Inc., a spin-out of the University of Edinburgh, developing XML-based data management systems for bio-medical applications.

05/11/07 (1:30pm, SVC-5/2200 Neptune)

Ramakrishnan Akella

Towards Personalization and Service Platforms

We describe elements of our research as we consider developing Service Platforms for Collaborative Search. Our goal is to enable the more effective functioning of users who are looking at both web and internal document collections, with the intent to provide service to a set of demanding users with complex queries. We have designed user interactive models for information retrieval and document clustering. Two specific set of results we have obtained concern:

• Active Learning: We have incorporated a combination of diversity, retrieval engine scores and relevance feedback into an active learning algorithm. This algorithm significantly outperforms existing algorithms by up to 15% on TREC HARD data.

• Semi-supervised clustering: We have developed a more probabilistic model to incorporate must and must-not link constraints on documents in clusters. We develop a new semi-supervised clustering algorithm which builds on this, and Bayesian regularization. This algorithm outperforms those in the literature by 15% again on 20 News group data set.

We will also touch upon ongoing and proposed research:

• More principled models which build upon our current active learning model, including relevance feedback and subspace projection methods.

• Collaborative search, to share learning across users by analyzing the cooccurrence between user clusters, query clusters and web-page clusters.

• Social network analysis in enterprise search, discussing how interactions among customers and engineers can be used to improve search performance.

(Joint work with Zuobing Xu, Seinjuti Chakraborty, and Jyotsna Gangwar)

10/26/06 (3:00pm, SVC-5/2200 Neptune)

Sudarshan Murthy (Portland State University)

Sub-search: Searching at Sub-document Granularity

Search-engine users are often interested in finding relevant sub-documents (that is, document fragments), but search engines return document lists without much indication of why a returned document is relevant, and which sub-documents are relevant. Consequently, users perform intra-document searches to find relevant sub-documents on their own, and judge their relevance. For example, a user may employ the find function in a browser to examine the occurrences of a query term in an HTML document. This situation exists because current search systems essentially treat the document as both the unit of transfer and the unit of comprehension or consultation. Sub-search (that is, sub-document search) makes the sub-document a unit of comprehension or consultation.

Two improvements to current search systems can facilitate sub-search: First, enhance search results to provide insight into sub-documents that make a document relevant. Second, allow users to navigate directly from an enhanced search result to a sub-document of interest. In both cases, sub-documents should not be limited to those defined by document authors. Also, the improvements should work for any type of document. These improvements allow users to judge the relevance of sub-documents without opening documents, and reduce user-effort to view sub-documents in their original context.

We present one approach to sub-search based on the notion of superimposed information, and demonstrate a prototype implementation. We also outline alternative approaches, and introduce some ideas on aggregating and ranking sub-documents.

10/18/06 (2:00pm, SVC-5/2200 Neptune)

Heikki Mannila (Helsinki)

Algorithms for Discovering Bucket Orders from Data

We consider bucket orders, i.e., total orders with ties. They can be used to capture the essential order information without overfitting the data: they form a useful concept class between total orders and arbitrary partial orders. We address the question of finding a bucket order for a set of items, given pairwise precedence information between the items. We describe simple and efficient algorithms for finding good bucket orders. Several of the algorithms have a provable approximation guarantee, and they scale well to large datasets. We provide experimental results on artificial and a real data that show the usefulness of bucket orders and demonstrate the accuracy and efficiency of the algorithms.

09/26/06 (1:00pm, SVC-5/2200 Neptune)

Kwong Yung (IT.com)

Discovering Semantic Themes and Social Networks from Email Corpus

The Enron corpus of email messages is analyzed using several novel extensions of latent Dirichlet allocation, a Bayesian model of document generation. The body of an email message is viewed as the usual bag of words. In addition to the textual body, however, the author and the recipients of the email message also provide important clues about the content of the email message. Incorporating social network underlying the email corpus not only improves discovery of semantic themes in the corpus but also identifies roles of actors in the network. Leveraging both the semantic content and the social network underlying an email corpus makes information retrieval far more powerful. The standard LDA model is extended in two directions. In one direction, the author-recipient-topic model of McCallum 2005 is cast in a two-dimensional framework of horizontal and vertical segmentation of topics. In the other direction, the hierarchical model of Blei 2004 is relaxed by allowing multiple paths for each document. Combining these two threads of improvements on LDA results in advances on both theoretical and practical fronts.



About Search Labs
 
Projects
 
Happenings
 

©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement