K Saravanan, Raghavendra Udupa, and A Kumaran
This report documents the participation of Mi-crosoft Research India (MSR India) in the Crosslingual Information Retrieval (CLIR) evaluation organized by the Forum for Infor-mation Retrieval Evaluation 2010 [FIRE 2010]. MSR India participated in two cros-slingual evaluation tasks, namely the Hindi-English and Tamil-English crosslingual tasks, in addition to the English-English monolingual task. Our core CLIR engine employed a lan-guage modeling based approach using query likelihood based document ranking and a probabilistic translation lexicon learned from English-Hindi and English-Tamil parallel cor-pora. In addition, we employed two specific techniques to deal with out-of-vocabulary terms in the crosslingual runs: first, generating transliterations directly or transitively, and second, mining possible transliteration equiva-lents from the documents retrieved in the first-pass. We show experimentally that each of these techniques significantly improved the overall retrieval performance of our crosslin-gual IR system. Our system, using all of the topic-description-and-narrative information, achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual English-English task; in crosslingual tasks, our systems achieved a peak performance of a MAP of 0.4977 in Hindi-English and 0.4145 in the Tamil-English. The post-task analyses indicate that the mining of appropriate transliterations from the top results of the first-pass retrieval achieved enhanced the crosslingual perfor-mance of our system overall, in addition to enhancing individual performance of more queries. Our Hindi-English crosslingual re-trieval performance was nearly equal (~97%) to the English-English monolingual retrieval performance, indicating the effectiveness of our approaches to handle OOV‟s to enhance the baseline performance of our CLIR system.
|Published in||the Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India|
A Kumaran, Mitesh Khapra, and Pushpak Bhattacharyya. Compositional Machine Transliteration, Association for Computing Machinery, Inc., January 2011.