W3C is a TREC test collection for use in "enterprise search" experiments. It is available for use in the TREC Enterprise Track.
| Scope | Corpus size (gigs) | Docs | Avdocsize (kb) | Zipped size (megs) | Bundles | Compression (gzip/full) | |
| lists | 1.855 | 198,394 | 9.8 | 221.8 | 119 | 0.117 | |
| dev | 2.578 | 62,509 | 43.2 | 300.5 | 164 | 0.114 | |
| www | 1.043 | 45,975 | 23.8 | 195.9 | 67 | 0.183 | |
| esw | 0.181 | 19,605 | 9.7 | 12.9 | 12 | 0.069 | |
| other | 0.047 | 3,538 | 14.1 | 6.0 | 4 | 0.124 | |
| people | 0.003 | 1,016 | 3.6 | 0.4 | 1 | 0.111 | |
| all | 5.7 | 331,037 | 18.1 | 737.5 | 367 | 0.126 |
The crawl included the following non-text pages: 2875 pdf, 702 ps, 217 ppt, 90 word, 3 rtf and 2 xls. In all cases the binaries were removed, and replaced (if possible) with HTML versions. After extraction we did content-based mime type checking:
| text/html | 311,134 | 94.0% |
| text/plain | 19,810 | 6.0% |
| message/rfc822 | 71 | 0.0% |
| text/x-roff | 18 | 0.0% |
| message/news | 3 | 0.0% |
| application/octet-stream | 1 | 0.0% |
| Total | 331,037 |
Updated: 2005-04-28
Nick Craswell (
)