Many modern information retrieval data analyses need to operate on web-scale data collections. These collections are sufficiently large as to make single-computer implementations impractical, apparently necessitating custom distributed implementations.
Instead, we have implemented a collection of Information Retrieval analyses atop DryadLINQ, a research LINQ provider layer over Dryad, a reliable and scalable computational middleware. Our implementations are relatively simple data parallel adaptations of traditional algorithms, and, due entirely to the scalability of Dryad and DryadLINQ, scale up to very large data sets. The current version of the toolkit, available for download below, has been successfully tested against the ClueWeb corpus.
While we hope that these tools prove useful for researchers hoping to work with larger data sets, they are also intended to be instructional in the use of DryadLINQ.
- Dennis Fetterly and Frank McSherry, A Data-Parallel Toolkit for Information Retrieval, in Proceedings of SIGIR, Association for Computing Machinery, Inc., 19 July 2010