Participants are welcome to explore other Web retrieval issues such as distributed IR, queries with misspellings, efficient indexing etc within the context of these experiments.
There are two datasets, one for the mixed query task and one for the enterprise task.
The corpus for the mixed query task is the .GOV test collection, distributed by CSIRO.
The corpus for the enterprise task is the W3C test collection, available for download.
Both collections include content (and http headers) retrieved during a breadth first crawl. Both include the extracted text of PDF and DOC files found during crawling. They are both naturalistic crawls which might be used in a real GOV or W3C search service (respectively). They differ in that the .GOV crawl is much larger, while the W3C data is more homogeneous, dealing with a smaller organisation which has organisation-specific publishing patterns. W3C patterns include publishing of web-based email and cvs archives, keeping old versions of some documents online on the main web site, inclusion of personal homepages and probably a better standard of URL-readability and cross-linking than in some organisations.
This task involves a single query set comprising: 75 topic distillation queries, 75 homepage finding queries and 75 named page finding queries. (All three types -- TD, HP and NP -- featured in last year's Web Track.) Since queries of the three types are mixed and unlabelled, participants must e.g. find a robust ranking method which works well for all three or find ways of mixing different methods.
Consider the query 'USGS', which is the acronym for the US Geological Survey. Some interpretations of this query are:
These all differ from a TREC adhoc interpretation of 'usgs', which might be: find me pages about usgs (i.e. containing relevant text). Web tasks differ from adhoc in a number of ways:
For more information on tasks, see last year's guidelines.
Some Topic Distillation History
Before TD began, we had the idea that "distillation" involves finding a short results list and that most answers would be homepages of relevant sites dedicated to the topic.
The TD2002 guidelines asked for 'a short list of key resources'. The lesson we learned was: It's more important to concentrate on the homepage idea. Because otherwise judges will 1) Choose queries which have no dedicated .GOV sites and 2) Judge such that most correct answers are non-homepages.
In TD2003 we required answers to be "a good entry point to a website principally devoted to the topic" i.e. homepages of relevant sites. Judges were also asked to pick topics which have at least one .GOV site. We still get a 'short list', but 'short' is not mentioned in the guidelines. Perhaps the short list is a side effect of preferring homepages, rather than a goal in itself. If distillation = site finding.
NIST will develop the topics. Example topic format:
<top> <num> Number: WT04-1 <title> usgs </top>
The title field only should be supplied to your system as the query.
These measures will be averaged across all three query types. Also, there will be some analysis by query type, to see e.g. whether reasonable performance is maintained for each. Within-type analysis will use the above measures, as well as others such as P@10.
In addition to retrieval runs, participants interested in query classification may submit classification runs, consisting of a label for each query: NP, HP or TD.
In last year's Web Track workshop, significant interest was shown in enterprise search, including possibly email search. It was suggested to use Enron data, but this was difficult to get and consisted of scanned+OCRed text, rather than more usual enterprise formats. Luckily W3C were willing to make their data available for experimentation, and this also provides a more positive and web-savvy corpus.
The collection, which has N documents and is M megabytes (X when compressed), can be downloaded here. [recrawl+rebuild underway, more info very soon]
Querying this data is "unusual for TREC" in two ways. First since the content is quite technical, it is probably impossible for NIST assessors to construct and judge queries representative of real W3C users. Second, it is unclear what information needs there are which are specific to email search (or wiki search or cvs archive search) or specific to enterprise search in general. It would be nice to think of appropriate enterprise information needs, in the same way that homepage finding was appropriate for Web data. But so far, we don't know what these information needs are.
The solution to both these problems is to allow participants to generate topics -- inventing information needs in whichever way seems sensible -- and judge the results using a simple web-based judging interface. Thus participants are exploring what works in enterprise search, but also what types of search make sense.
For topic generation, we assume a simple keyword search with a dropdown list ofr search scope. The scope can be one of:
Topics should be of the form:
<top> <num> Number: 1 <scope> all|www|dev|people|esw|lists <title> entity ampersand <desc> Description: Is ampersand always represented the same way? Relevant documents will describe or discuss ways of representing the ampersand character in e.g. HTML and XML. </top>
The title and scope (but not desc) can be supplied to your system as the query.
Relevance judgments will be on a 4 point scale:
In cases where there are not degrees, e.g. known item search, this can be collapsed into two points: 0 and 3.
We will attempt to strike a balance between completeness of judgments and effort required from participants, perhaps by sacrificing completeness.
There are no indexing restrictions. You may index all of each document or exclude certain fields as you wish.
It is permissable to submit runs based on manual querying/interaction. Automatic runs (produced with no manual intervention whatsoever) are the focus of evaluation, but manual runs can be interesting and are encouraged since they increase the diversity of the judged document pool. Manual runs should be:
It is also an interesting idea to submit a manual query classification in the mixed query task, to see whether people can distinguish between TD, HP and NP queries.
topic-id Q0 docno rank sim tag topic - topic number, Q0 - unused field (the literal `Q0'), docno - document id taken from the DOCNO field of the text, rank - rank assigned to the document, sim - similarity computed between the document and the topic, tag - run tag.If you choose to submit a query classification, you may submit up to 5 runs, and the format is:
topic type tag topic - topic number, type - 'HP' or 'NP' or 'TD' tag - run tag
nickcr at microsoft . com
david . hawking at csiro . au