Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
A Challenge Set for Advancing Language Modeling

Geoffrey Zweig and Chris J.C. Burges

Abstract

In this paper, we describe a new, publicly available corpus intended to stimulate research into language modeling techniques which are sensitive to overall sentence coherence. The task uses the Scholastic Aptitude Test’s sentence completion format. The test set consists of 1040 sentences, each of which is missing a content word. The goal is to select the correct replacement from amongst five alternates. In general, all of the options are syntactically valid, and reasonable with respect to local N-gram statistics. The set was generated by using an N-gram language model to generate a long list of likely words, given the immediate context. These options were then hand-groomed, to identify four decoys which are globally incoherent, yet syntactically correct. To ensure the right to public distribution, all the data is derived from out-of-copyright materials from Project Gutenberg. The test sentences were derived from five of Conan Doyle’s Sherlock Holmes novels, and we provide a large set of Nineteenth and early Twentieth Century texts as training material.

Details

Publication typeInproceedings
Published inWorkshop on the Future of Language Modeling for HLT, NAACL-HLT 2012
PublisherACL/SIGPARSE
> Publications > A Challenge Set for Advancing Language Modeling