Data-driven Research at Web Scale

This session will bring together a group of leaders in information retrieval and language modeling to discuss the challenges in information retrieval and how language modeling approaches may help address some of these challenges. The focus is on the use of n-gram models to further research in areas such as document representation and content analysis (e.g., clustering, classification, information extraction), query analysis (e.g., query suggestion, query reformulation), retrieval models and ranking, and spelling as well as the access to n-grams as an enabler of experimental design. Previous efforts of delivering n-grams to the research community adopted a data release approach with a cut off on the n-gram counts that obfuscate the long tail effects, an issue this service-based approach makes possible for further studies. Moreover, previous efforts also focused on just the document body; whereas richer types of textual contents are included in the Web N-gram service that can engage researchers in new innovations. The Web N-gram service provides access to petabytes of data via services—up to two orders of magnitude greater than currently available offerings. Finally, by providing regular data refresh, the Web N-gram service can open up new research directions in fields where lack of dynamic data has locked academic researchers into conducting research over static and stale data sets.

Speaker Details

Evelyne Viegas, is responsible for the Online Technologies and Web Cultures initiative in the External Research & Programs team at Microsoft Research in Redmond WA, U.S. Prior to her present role, Evelyne has been working as a Technical Lead, and Program Manager at Microsoft delivering Natural Language Processing components to projects for MSN, Office, and Windows. Before Microsoft, and after completing her Ph.D. in France, she worked as a Principal Investigator at the Computing Research Laboratory in New Mexico on an ontology-based Machine Translation project. She has edited the following books: “Computational Lexical Semantics” Cambridge University Press and “Breadth and Depth of Semantic Lexicons” Kluwer Academic Press. Her current research interests include approaches and experiences to make the web more “intelligent” and safer with a focus on finding information, sitting at the desktop or while on the move.

Hirsh, Haym
is Professor of Computer Science at Rutgers University, and a Visiting Scholar at MIT’s Sloan School of Management and Center for Collective Intelligence. From 2006-2010 he served as Director of the Division of Information and Intelligent Systems at the National Science Foundation, and he has previously held visiting positions at Bar-Ilan University, CMU, NYU, and the University of Zurich. His research is on foundations and applications of machine learning, data mining, and information retrieval. Haym received his BS degree from the Mathematics and Computer Science Departments at UCLA and his MS and PhD from the Computer Science Department at Stanford University.

I have an MSc in Mathematics, PhD in linguistics, and I worked as a freelance translator in between. In the end, I came to working in computational linguistics. My research interests combine three domains: linguistics, computer science and communication studies.

Probably the most interesting bit in my recent research is automatic acquisition of representative corpora from the Web and their analysis in terms of the distribution of domains and genres. Another recent development is an umbrella of projects aimed at finding translation equivalents for terminology and general lexicon from such corpora.

Date:
Speakers:
Evelyne Viegas, Haym Hirsh, and Serge Sharoff
Affiliation:
MSR, Rutgers University, University of Leeds