A Data-Parallel Toolkit for Information Retrieval

  • Dennis Fetterly ,
  • Frank McSherry

Proceedings of SIGIR |

Published by Association for Computing Machinery, Inc.

In this work, we describe the collection of information retrieval algorithms we have implemented using DryadLINQ. DryadLINQ is a data parallel processing system that allows programmers to write distributed programs without worrying about the implementation of a distributed system. DryadLINQ executes programs containing SQL-like Language Integrated Query statements (LINQ) by shipping the computation to nodes in the cluster for parallel execution. The ability to break a computation into many pieces that can be processed on individual machines means that even a small number of computers can be leveraged to reduce the time necessary to process large collections.