Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
A Data-Parallel Toolkit for Information Retrieval

Dennis Fetterly and Frank McSherry

Abstract

In this work, we describe the collection of information retrieval algorithms we have implemented using DryadLINQ. DryadLINQ is a data parallel processing system that allows programmers to write distributed programs without worrying about the implementation of a distributed system. DryadLINQ executes programs containing SQL-like Language Integrated Query statements (LINQ) by shipping the computation to nodes in the cluster for parallel execution. The ability to break a computation into many pieces that can be processed on individual machines means that even a small number of computers can be leveraged to reduce the time necessary to process large collections.

Details

Publication typeInproceedings
Published inProceedings of SIGIR
PublisherAssociation for Computing Machinery, Inc.
> Publications > A Data-Parallel Toolkit for Information Retrieval