Improving Cross-Language Information Retrieval by Transliteration Generation and Mining

  • K. Saravanan ,
  • Raghavendra Udupa ,
  • A Kumaran

LNCS volume on FIRE-2010 Proceeding |

Published by Springer

While state of the art Cross-Language Information Retrieval (CLIR) systems are reasonably accurate and largely robust, they typically make mistakes in handling proper or common nouns. Such terms suffer from compounding of errors during the query translation phase, and during the document retrieval phase. In this paper, we propose two techniques, specifically, transliteration generation and mining, to effectively handle such query terms that may occur in their transliterated form in the target corpus. We explore systematically the effect of the transliteration techniques on the overall retrieval performance of a baseline state-of-the-art CLIR system. The baseline CLIR engine employed is a language modeling based system using query likelihood based document ranking and a probabilistic translation lexicon learned from a bilingual parallel corpora. Transliteration generation approach generates the possible transliteration equivalents for the out of vocabulary (OOV) terms during the query translation phase. The mining approach mines potential transliteration equivalents for the OOV terms, from the first-pass retrieval from the target corpus, for a final retrieval. An implementation of such an integrated system was employed for the participation of Microsoft Research India (MSR India) team in the cross-language information retrieval evaluation shared task organized by the Forum for Information Retrieval Evaluation 2010[1]. MSR India participated in two cross-language evaluation tasks, namely the Hindi-English and Tamil-English tasks, in addition to the English-English monolingual task. Using all of the title-description-and-narrative information in the topic, our system achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual English-English task. In the cross-language tasks, the system achieved a peak performance of a MAP of 0.4977 in Hindi-English task and 0.4145 in the Tamil-English task. The Hindi-English cross-language retrieval performance improved from 92% to 97% to the English-English monolingual retrieval performance, underscoring the effectiveness of the integrated CLIR system in enhancing the performance of the baseline CLIR system.