TREC-2004 Web Track Guidelines

1. This Year's Aims

  1. Mixed query task: Evaluate web search with a mixed query stream: 75 homepage finding, 75 named page finding and 75 topic distillation queries. Find approaches which work well over the 225 queries, without knowledge of which is which.
  2. Increase the available queries/judgments for the .GOV test collection.
  3. Enterprise task: Begin exploring retrieval on an enterprise test collection: a crawl of The World Wide Web Consortium (W3C). An important feature of this data is that -- although it addresses a single organisation -- it contains several distinct types of Web page: an official web site, a wiki web, a web interface to their public cvs tree and a web email archive (of public lists). There is also the opportunity to study search on subsets (e.g. email only). The test collection will be downloadable via HTTP. Because W3C documents are too technical for NIST assessors, we plan that participating groups will each generate a small number of queries and judge relevance using a simple Web UI.

Participants are welcome to explore other Web retrieval issues such as distributed IR, queries with misspellings, efficient indexing etc within the context of these experiments.

2. Datasets

There are two datasets, one for the mixed query task and one for the enterprise task.

The corpus for the mixed query task is the .GOV test collection, distributed by CSIRO.

The corpus for the enterprise task is the W3C test collection, available for download.

Both collections include content (and http headers) retrieved during a breadth first crawl. Both include the extracted text of PDF and DOC files found during crawling. They are both naturalistic crawls which might be used in a real GOV or W3C search service (respectively). They differ in that the .GOV crawl is much larger, while the W3C data is more homogeneous, dealing with a smaller organisation which has organisation-specific publishing patterns. W3C patterns include publishing of web-based email and cvs archives, keeping old versions of some documents online on the main web site, inclusion of personal homepages and probably a better standard of URL-readability and cross-linking than in some organisations.

3. Mixed query task

This task involves a single query set comprising: 75 topic distillation queries, 75 homepage finding queries and 75 named page finding queries. (All three types -- TD, HP and NP -- featured in last year's Web Track.) Since queries of the three types are mixed and unlabelled, participants must e.g. find a robust ranking method which works well for all three or find ways of mixing different methods.

Consider the query 'USGS', which is the acronym for the US Geological Survey. Some interpretations of this query are:

Homepage finding
Find me the URL of the USGS homepage ( I have forgotten or don't know that URL, or I prefer typing 'usgs' to typing the full URL.
Topic distillation
Find me homepages of sites describing aspects of the US Geological Survey. Possible answers include,,, (Another example: 'literacy', might mean find me .GOV sites about literacy, with answers including, and
Named page finding
Find me the URL of a non-homepage e.g. searching for with the query 'introduction to usgs'. The query is the name of the page in question (rather than e.g. words describing its topic).

These all differ from a TREC adhoc interpretation of 'usgs', which might be: find me pages about usgs (i.e. containing relevant text). Web tasks differ from adhoc in a number of ways:

Homepage bias
In TD and HP tasks, a homepage is a good answer, because the user has a search and browse strategy. The strategy is to perform a simple query, by general topic or by site name, then if the results contain a good site, to browse/interact with that site. The function of a homepage is to act as the entry point for the useful site.
Query by name
In HP and NP tasks, the query is the name of the site or page in question. For the USGS site it's 'usgs' or 'us geological survey', rather than 'Earth resources environment survey' or 'federal geological agency'. In general these would be valid queries for known item search, but they're not "query by name".
TD topic selection
Since the idea of TD is to find relevant sites, it only applies when such sites exist. Many topics like 'impact exchange rates cotton industry' have relevant pages, but no dedicated .GOV sites, so shouldn't be used in a TD task. Only some topics have relevant .GOV sites e.g. 'cotton industry'.

For more information on tasks, see last year's guidelines.

Some Topic Distillation History

Before TD began, we had the idea that "distillation" involves finding a short results list and that most answers would be homepages of relevant sites dedicated to the topic.

The TD2002 guidelines asked for 'a short list of key resources'. The lesson we learned was: It's more important to concentrate on the homepage idea. Because otherwise judges will 1) Choose queries which have no dedicated .GOV sites and 2) Judge such that most correct answers are non-homepages.

In TD2003 we required answers to be "a good entry point to a website principally devoted to the topic" i.e. homepages of relevant sites. Judges were also asked to pick topics which have at least one .GOV site. We still get a 'short list', but 'short' is not mentioned in the guidelines. Perhaps the short list is a side effect of preferring homepages, rather than a goal in itself. If distillation = site finding.

NIST will develop the topics. Example topic format:

<num> Number: WT04-1
<title> usgs

The title field only should be supplied to your system as the query.


Average precision
Average precision is a standard TREC measure. It can be applied to topic distillation queries and is equivalent to MRR in cases where there is one relevant answer (HP and NP queries). It will be calculated on the whole run (1000 ranks), but the measure puts a natural emphasis on the top-ranked documents.
Since NPHP queries have one answer each, and distillation queries have only a few good answers each, a simple uniform measure is whether anything was found in the top 5 results. The top 5 is what might typically appear on the results page of a web search system, without the user needing to scroll ("above the fold"). If a correct answer appears in the top 5 for 90 of 225 queries, then S@5=0.4.
Success@1 and Success@10
It is desirable that the first result seen is a good answer (S@1). An "embarrassment factor" comes from finding nothing in the top 10, which typically is the first page of search results, so we also measure S@10.

These measures will be averaged across all three query types. Also, there will be some analysis by query type, to see e.g. whether reasonable performance is maintained for each. Within-type analysis will use the above measures, as well as others such as P@10.

In addition to retrieval runs, participants interested in query classification may submit classification runs, consisting of a label for each query: NP, HP or TD.

4. The enterprise task

In last year's Web Track workshop, significant interest was shown in enterprise search, including possibly email search. It was suggested to use Enron data, but this was difficult to get and consisted of scanned+OCRed text, rather than more usual enterprise formats. Luckily W3C were willing to make their data available for experimentation, and this also provides a more positive and web-savvy corpus.

The collection, which has N documents and is M megabytes (X when compressed), can be downloaded here. [recrawl+rebuild underway, more info very soon]

Querying this data is "unusual for TREC" in two ways. First since the content is quite technical, it is probably impossible for NIST assessors to construct and judge queries representative of real W3C users. Second, it is unclear what information needs there are which are specific to email search (or wiki search or cvs archive search) or specific to enterprise search in general. It would be nice to think of appropriate enterprise information needs, in the same way that homepage finding was appropriate for Web data. But so far, we don't know what these information needs are.

The solution to both these problems is to allow participants to generate topics -- inventing information needs in whichever way seems sensible -- and judge the results using a simple web-based judging interface. Thus participants are exploring what works in enterprise search, but also what types of search make sense.

For topic generation, we assume a simple keyword search with a dropdown list ofr search scope. The scope can be one of:

the whole crawl
just the main site
the cvs interface at
the personal pages at
the wiki relating to the semantic web, quality assurance, rdf and other topics at
the email archives on the Web at

Topics should be of the form:

<num> Number: 1
<scope> all|www|dev|people|esw|lists
<title> entity ampersand
<desc> Description:
Is ampersand always represented the same way?
Relevant documents will describe or discuss
ways of representing the ampersand character
in e.g. HTML and XML.

The title and scope (but not desc) can be supplied to your system as the query.

Relevance judgments will be on a 4 point scale:

Irrelevant document. The document does not contain any information about the topic.
Marginally relevant document. The document only points to the topic. It does not contain more or other information than the topic statement.
Fairly relevant document. The document contains more information than the topic statement but the presentation is not exhaustive. In the case of a multifaceted topic, only some of the subthemes are covered.
Highly relevant document. The document discusses the themes of the topic exhaustively. In the case of multifaceted topics, all or most subthemes are covered.

In cases where there are not degrees, e.g. known item search, this can be collapsed into two points: 0 and 3.

We will attempt to strike a balance between completeness of judgments and effort required from participants, perhaps by sacrificing completeness.

5. Indexing and Manual Interaction Restrictions

There are no indexing restrictions. You may index all of each document or exclude certain fields as you wish.

It is permissable to submit runs based on manual querying/interaction. Automatic runs (produced with no manual intervention whatsoever) are the focus of evaluation, but manual runs can be interesting and are encouraged since they increase the diversity of the judged document pool. Manual runs should be:

  1. Marked as manual when submitted.
  2. Quarantined from automatic runs. Automatic runs should not benefit (or change in any way) based on what was learned during manual interaction with this year's topic set.

It is also an interesting idea to submit a manual query classification in the mixed query task, to see whether people can distinguish between TD, HP and NP queries.

6. Submissions and Judgments

  1. All submissions are due at NIST on or before XX August 2004.
  2. Submission information:
  3. For the mixed query task it is likely that NIST will accept up to 5 official submissions for each task, but the number of fully judged runs per group will depend upon the number of submissions, the degree of overlap and the judging resources available. Hopefully it will be possible to judge two topic distillation runs and two home/named page runs per group.
  4. For the mixed query task, judging will be performed by NIST assessors. Enterprise task queries will be judged by participants.
  5. Judgments in the mixed query task will be binary. Key resource OR Not key resource. Home/named page OR Not home/named page.

Updated: 2004-06-17

nickcr at microsoft . com
david . hawking at csiro . au