Raghavendra Udupa, K Saravanan, A Kumaran, and Jagadeesh Jagarlamudl
Named Entities (NEs) form a significant fraction of query terms in Information Retrieval (IR) systems and their retrieval has been shown to correlate highly with the IR system performance. NEs are even more important in Cross Language Information Retrieval (CLIR), as in addition to being a significant component of query terms. In the recent times, the large quantity and the perpetual availability of news corpora in many of the world’s languages simultaneously, has spurred interest in a promising alternative to NE translation or transliteration, particularly, the mining of Named Entity Transliteration Equivalents (NETEs) from such news corpora (Klementiev and Roth, 2006; Tao et al., 2006). Formally, comparable news corpora are time-aligned news stories in a pair of languages, over a reasonably long duration. NETEs mined from comparable news corpora could be valuable in many tasks such as CLIR and MT, to effectively complement the bilingual dictionaries and the machine transliteration systems. This opportunity is precisely what we address in our work. We introduce a novel method, called MINT (MIning Namedentity Transliteration equivalents), with the following innovations for effective mining of NETEs from comparable corpora: MINT relies on little linguistic resources, requiring a Named Entity Recoginizer (NER) in only one language; hence NETEs from even a resource poor language may be mined, when paired with a language where an NER is available.
|Published in||the 17th ACM conference on Information and knowledge management (CIKM 2008), Napa Valley, USA|
|Publisher||Association for Computing Machinery, Inc.|
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or firstname.lastname@example.org. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.