W3C Test Collection

W3C is a TREC test collection for use in "enterprise search" experiments. It is available for use in the TREC Enterprise Track.

Scope Corpus size (gigs) Docs Avdocsize (kb) Zipped size (megs) Bundles Compression (gzip/full)
lists 1.855 198,394 9.8 221.8 119 0.117
dev 2.578 62,509 43.2 300.5 164 0.114
www 1.043 45,975 23.8 195.9 67 0.183
esw 0.181 19,605 9.7 12.9 12 0.069
other 0.047 3,538 14.1 6.0 4 0.124
people 0.003 1,016 3.6 0.4 1 0.111
all 5.7 331,037 18.1 737.5 367 0.126
Table 1: W3C collection by scope: size in gigs, document count, average document size, size when compressed, number of compressed bundles and compression rate.

Mime types

The crawl included the following non-text pages: 2875 pdf, 702 ps, 217 ppt, 90 word, 3 rtf and 2 xls. In all cases the binaries were removed, and replaced (if possible) with HTML versions. After extraction we did content-based mime type checking:

text/html 311,134 94.0%
text/plain 19,810 6.0%
message/rfc822 71 0.0%
text/x-roff 18 0.0%
message/news 3 0.0%
application/octet-stream 1 0.0%
Total 331,037
Table 2: True mime types after we extracted html from the pdfs etc.

Updated: 2005-04-28

Nick Craswell ( email address as image)