Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Book Search
Book Search

Algorithms and systems for the domain specific searching and browsing of collections of digitized books.

Project overview

Through mass-digitization projects and with the use of OCR technologies, the full texts of digitized books are becoming available on the Web and in digital libraries. The unprecedented scale of these efforts, the unique characteristics of the digitized material, as well as the unexplored possibilities of user interactions raise a rich set of research questions. These range from questions on appropriate indexing and retrieval algorithms to user interface issues.

The goal of our project is to investigate book search as a specialized IR domain, and to develop and evaluate retrieval functions and complete search systems for collections of digitized books. On the indexing side, we are particularly interested in investigating an extended feature space including external sources, such as Wikipedia, and structural information internal to the books, such as the the back-of-book index. On the user interface side, our focus is on supporting users in reading related activities, such as browsing and active reading.

Project highlights

  • We developed a complete book search system for which is available publicly at The book search system allows users to search and browse a collection of over 50,000 digitized books, and to annotate the books and their pages.
  • Our topical closeness measure defined by applying a random walk model on an extended Wikipedia graph that connects the user's query with books in a target corpus through the link graph of Wikipedia was found to provide a good indicator of relevance, boosting the retrieval score of relevant books. Read more on this in our WSDM'09 paper.
  • We developed a multi-field inverted index structure and built an experimental retrieval platform. Using this platform and a collection of 10,000 digitized books, we investigated the contribution of structural book features to retrieval effectiveness. Read more on this in our ECIR'08 paper.

Project team and collaborators

  • Jamie Costello (Microsoft Research)
  • Nick Craswell (Microsoft Research)
  • Gabriella Kazai (Microsoft Research)
  • Marijn Koolen (University of Amsterdam)
  • Natasa Milic-Frayling (Microsoft Research)
  • Thomas Roelleke (Queen Mary, University of London)
  • Michael Taylor (Microsoft Research)
  • Hengzhi Wu (Queen Mary, University of London)