Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining

This report documents the participation of Mi-crosoft Research India (MSR India) in the Crosslingual Information Retrieval (CLIR) evaluation organized by the Forum for Infor-mation Retrieval Evaluation 2010 [FIRE 2010]. MSR India participated in two cros-slingual evaluation tasks, namely the Hindi-English and Tamil-English crosslingual tasks, in addition to the English-English monolingual task. Our core CLIR engine employed a lan-guage modeling based approach using query likelihood based document ranking and a probabilistic translation lexicon learned from English-Hindi and English-Tamil parallel cor-pora. In addition, we employed two specific techniques to deal with out-of-vocabulary terms in the crosslingual runs: first, generating transliterations directly or transitively, and second, mining possible transliteration equiva-lents from the documents retrieved in the first-pass. We show experimentally that each of these techniques significantly improved the overall retrieval performance of our crosslin-gual IR system. Our system, using all of the topic-description-and-narrative information, achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual English-English task; in crosslingual tasks, our systems achieved a peak performance of a MAP of 0.4977 in Hindi-English and 0.4145 in the Tamil-English. The post-task analyses indicate that the mining of appropriate transliterations from the top results of the first-pass retrieval achieved enhanced the crosslingual perfor-mance of our system overall, in addition to enhancing individual performance of more queries. Our Hindi-English crosslingual re-trieval performance was nearly equal (~97%) to the English-English monolingual retrieval performance, indicating the effectiveness of our approaches to handle OOV‟s to enhance the baseline performance of our CLIR system.

2010_MSRI-FIRE2010.pdf
PDF file

In  the Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India

Details

TypeArticle
> Publications > Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining