Algorithms and systems for the domain specific searching and browsing of collections of digitized books.
Through mass-digitization projects and with the use of OCR technologies, the full texts of digitized books are becoming available on the Web and in digital libraries. The unprecedented scale of these efforts, the unique characteristics of the digitized material, as well as the unexplored possibilities of user interactions raise a rich set of research questions. These range from questions on appropriate indexing and retrieval algorithms to user interface issues.
The goal of our project is to investigate book search as a specialized IR domain, and to develop and evaluate retrieval functions and complete search systems for collections of digitized books. On the indexing side, we are particularly interested in investigating an extended feature space including external sources, such as Wikipedia, and structural information internal to the books, such as the the back-of-book index. On the user interface side, our focus is on supporting users in reading related activities, such as browsing and active reading.
- We developed a complete book search system for which is available publicly at http://www.booksearch.org.uk/. The book search system allows users to search and browse a collection of over 50,000 digitized books, and to annotate the books and their pages.
- Our topical closeness measure defined by applying a random walk model on an extended Wikipedia graph that connects the user's query with books in a target corpus through the link graph of Wikipedia was found to provide a good indicator of relevance, boosting the retrieval score of relevant books. Read more on this in our WSDM'09 paper.
- We developed a multi-field inverted index structure and built an experimental retrieval platform. Using this platform and a collection of 10,000 digitized books, we investigated the contribution of structural book features to retrieval effectiveness. Read more on this in our ECIR'08 paper.
Project team and collaborators
- Jamie Costello (Microsoft Research)
- Nick Craswell (Microsoft Research)
- Gabriella Kazai (Microsoft Research)
- Marijn Koolen (University of Amsterdam)
- Natasa Milic-Frayling (Microsoft Research)
- Thomas Roelleke (Queen Mary, University of London)
- Michael Taylor (Microsoft Research)
- Hengzhi Wu (Queen Mary, University of London)
- Gabriella Kazai, Natasa Milic-Frayling, and Jamie Costello, Towards methods for the collective gathering and quality control of relevance assessments, in Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval (Boston, MA, USA, July 19 - 23, 2009). SIGIR '09, Association for Computing Machinery, Inc., July 2009.
- Marijn Koolen, Gabriella Kazai, and Nick Craswell, Wikipedia Pages as Entry Points for Book Search, in Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM'09), Association for Computing Machinery, Inc., February 2009.
- Hengzhi Wu, Gabriella Kazai, and Thomas Roelleke, Modelling Anchor Text Retrieval in Book Search based on Back-of-Book Index, in SIGIR 2008 Workshop on Focused Retrieval, Association for Computing Machinery, Inc., July 2008.
- Hengzhi Wu, Gabriella Kazai, and Michael Taylor, Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books, in Advances in Information Retrieval, 30th European Conference on IR Research, ECIR 2008, Springer, April 2008.