Web N-gram Workshop

July 23, 2010 – Geneva, Switzerland

This workshop brought together leaders in information retrieval and language modeling to discuss the challenges in information retrieval and how language modeling approaches may help address some of these challenges. We focused on the use of n-gram models to further research in areas such as document representation and content analysis, query analysis, retrieval models and ranking, and spelling, as well as the access to n-grams as an enabler of experimental design.

Workshop Aims

The aim of the workshop is to bring together a group of leaders in information retrieval and language modeling to discuss the challenges in information retrieval and how language modeling approaches may help address some of these challenges. At the workshop, we will focus on the use of n-gram models to further research in areas such as document representation and content analysis (e.g., clustering, classification, information extraction), query analysis (e.g., query suggestion, query reformulation), retrieval models and ranking, and spelling as well as the access to n-grams as an enabler of experimental design.

Often discussed in the research community is the lack of large-scale dataset and benchmarks to run experiments. This workshop will address this issue by bringing together the community of researchers who use n-grams, already made available by Yahoo and Google/LDC along with a new Web N-gram service through which Microsoft Research, in partnership with Microsoft Bing, is providing the research community access to petabytes of Web N-gram via a cloud-based platform.

The Web N-gram services directly address the data need by enabling the community of researchers to create data benchmarks for repeatable experiments, and by enabling the research community to be at the forefront of inventions based on real-world, large-scale data.

The Microsoft Web N-gram services, currently in Beta, will be made available to participants upon request.

Previous efforts of delivering n-grams to the research community adopted a data release approach with a cut off on the n-gram counts that obfuscate the long tail effects, an issue this service-based approach makes possible for further studies. Moreover, previous efforts also focused on just the document body; whereas richer types of textual contents are included in the Web N-gram service that can engage researchers in new innovations.

Another notable difference is the scale: the Web N-gram service provides access to petabytes of data via services—up to two orders of magnitude greater than currently available offerings. Finally, by providing regular data refresh, the Web N-gram service can open up new research directions in fields where lack of dynamic data has locked academic researchers into conducting research over static and stale data sets.

Topics

We are now requesting paper submissions for the Web N-gram Workshop.

We encourage researchers to use the Microsoft Web N-gram services to explore novel applications of language models (e.g., long tail effects) and use of these data for enhancing the search experience (e.g., use of anchor text as a proxy to queries). We will also consider other applications such as machine translation and speech.

If you would like to use the Microsoft Web N-gram services in preparation of your paper, send an e-mail message to webngram@microsoft.com to request access.

We also encourage research and experiments using or comparing different n-grams data sets to ultimately help create, at the workshop, a useful n-gram baseline for the research community, in terms of n-gram attributes such as size, access, content, and model types needed for researchers.

For more information, see Submissions.

Planned Activities

As part of the workshop, experiment results will be presented via talks (average of 15 minutes per talk, plus 5 minutes of questions and answers) and with posters and/or demo sessions. In addition, there will be a panel discussion on providing access to data, with a focus on academia needs, challenges, and opportunities for industries to provide such data.

Contact Information

For information, send an e-mail message to ngramwkp@microsoft.com.

Venue

The Web N-gram workshop is being held as part of SIGIR 2010.

The workshop will take place in Geneva, Switzerland. Further information on the venue can be found on the SIGIR 2010 Venue site.