This download is provided for the purpose of the Speller Challenge. This is a development dataset based on the publicly available TREC queries (2008 Million Query Track). Queries are annotated by using the same guidelines and processes as in the creation of the Bing Test Dataset.
Note By installing, copying, or otherwise using this software, you agree to be bound by the terms of its license. Read the license.
Using the dataset with the Speller Challenge
About Speller Challenge TREC Data
For the purpose of the Speller Challenge, a development dataset based on the publicly available TREC queries (2008 Million Query Track) are annotated using the same guidelines and processes as in the creation of the Bing Test Dataset.
The TREC Evaluation Dataset is constructed as follows.
- First of all, a set of queries are sampled from the 2008 Million Query Track dataset.
- Then, all the queries that are clearly URLs or email addresses are removed.
- The remaining queries are then normalized in the following manner: texts are lower-cased, numbers are retained but other non-alphabetic letters are removed, all punctuations (including hyphens, apostrophes and underscores) are replaced with white spaces, the remaining texts are tokenized based on white-space.
- The spelling of each normalized query in the dataset is manually judged and corrected by up to 3 trained, independent experts. The guidelines encourage the experts to annotate multiple plausible spelling variations for each query so as to minimize any biases towards or against any typographic styles. For example, we retain all common spelling variations such as “brittany spears” versus “britney spears”. We also accept common variants for word breaks such as “nonprofit” versus “non profit”, and “webpage” versus “web page”, assuming that there is no systematic preference to one variant to others.
The data file is of the format that each line is
query <tab> suggestion1 <tab> suggestion2 …
Here are some statistics of the dataset.
- Number of queries: 5892
- Number of queries that are judged as misspelled: 311
- Number of queries that have at least one spelling suggestion that is different from the original query: 1122
- Number of queries that have 1 spelling suggestion: 5030
- Number of queries that have 2 spelling suggestions: 824
- Number of queries that have 3 spelling suggestions: 35
- Number of queries that have 4 spelling suggestions: 3