The Multilingual Systems Group explores software technologies to enable seamless content-creation, storage, search, access, and interaction with multiple languages.
|
|
|
|
|
||||
|
Kalika Bali |
Monojit Chaudhury |
A Kumaran |
Raghavendra Udupa |
Research Overview
At the Multilingual Systems (MLS) group in Microsoft Research India, we believe that Multilingual Information Access is critical for the acquisition, dissemination, exchange, and understanding of knowledge in the global information society. The accelerated growth in the size, content and reach of Internet, the diversity of user demographics and the skew in the availability of information across languages, all point to the increasingly critical need for Multilingual Information Access. We are an interdisciplinary research group focusing on technologies for Multilingual Information Access such as Cross-language Information Retrieval, Multilingual Information Extraction, and Machine Translation that bridge the gap between available information and the user needs transparently across languages.
To this end, we carry out cutting-edge research on a) several aspects of Cross-language Information Retrieval and Multilingual Information Extraction including query expansion, domain adaptation, automatic alignment of multilingual corpora, multilingual named entity extraction, and machine transliteration b) automated and collaborative creation of parallel corpora for Machine Translation and c) fundamental properties of languages and language phenomena including language acquisition and evolution, structural properties of corpora in the framework of complex networks, and interaction between syntax and prosody.
In addition, we are interested in robust fundamentals, especially, annotation standards, data collection efforts and basic tools for research in Indian languages.More importantly, we would like to enable and be a part ofa strong research eco-system in Multilingual Information Access in India.
projects
CLIR-Cross Lingual Information Retrieval: As the heterogeneity of the data available on the web increases, it is becoming increasingly important to enable the user to access the relevant information available across languages. Cross Lingual Information Retrieval research focuses on data driven approaches tohelpusers in easily organizing and accessingthe information when needed. Parallel data, even though very useful, is expensive to obtainand hence it is not available in sufficient amount. On the other hand comparable corpora, documents talking about sametopic in different languages, are available in abundant. Here we try to exploit both these resources to bridge the language barrier for Cross-lingual applications.
Currently we are experimenting with Hindi to English CLIR task, where the query will be expressed in Hindi and the relevant documents need to be extracted from English document collection. The query in Hindi is translated into English using word by word translation with the help of statistically learnedword alignments. The relevant documents are then retrieved using a Language Modelling based retrieval algorithm. On CLEF 2007 data set we found that the cross lingual system was able to achieve 73.4% of the monolingual IR systems performance.
Along with this we are also working on domain adaptation, cross lingual text modelling , automatic alignment of comparable corpora and machine transliteration
Mining Linguistic Data: In this project area, we attempt to mine the Internet for parallel data. Our mining techniques are statistical in nature, in contrast to the ad-hoc heuristics typically used in mining internet data,.Specifically, we mine multilingual news-wire articleson the same time-scale, to yield specific types of parallel data, namely, Parallel Sentences andSub-sentential Fragmentsand Named Entity TransliterationsOur experiments in many language pairs from different language families (for example, English-Spanish, English-Russian, English-Hindi and English-Kannada), indicated great potential for such approaches, in addition to highlighting the applicability of our methods to many of the world's languages
IL-POST-Indian Language Parts of Speech Tagset: Parts-of-Speech (POS) tagging is an important process for most Natural Language Processing (NLP) tasks. POS annotations capture the morphosyntactic features of the words from the
given context in a text and hence can provide useful information for subsequent stages of processing such as chunking, named entity detection, and parsing.
As a part of this project we have designed IL-POST, a universal Parts-of-Speech tagset framework covering most of the Indian languages (ILs) following a hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses the question of fine granularity of the tags balanced with user or language specific needs in an efficient and principled manner. We follow a hierarchical schema that enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility
This project isa collaborative effort between Linguists, Computational Linguists, and Computer Scientists fromMSRI, AU-KBC, Delhi University, IIT Bombay, Jawaharlal Nehru university, and Tamil University.
Multilingual Systems Group has released POS-tagged data in two Indian languages, namely Hindi and Bangla for the research community. To learn more about IL-POST, and how to obtain the data and related tools, please click here.
Linguistic Corpora Collection-wikiBabel: Language independent methodologies in Computational Linguistics and NLPpave the way for quick adaptation of language technologies across languages.Hence, such approaches might prove to be of great advantage to resource-poor languages, as a generic system may be adopted for many languages, quickly and
transparently.While the linguistic data required for training such systems are hand-created by linguists, in several cases, it is possible that such data may be created by those who are fluent in a language. wikiBABEL is a project in the MLS group, as a generic platform for collecting parallel linguistic data from the Internet population.The objective of this project is to induce the Internet population to contribute parallel data which may be used for training the Statistical MT system being developed in MSR.
WikiBABEL is available as Alpha in http://translator/WikiBABEL.
Complex Networks and Language: In recent times networks are being extensively used for modeling complex systems. We apply complex network theory to understand the structure and evolution of human languages. We construct
word networks based on the distributional hypothesis and study their universal properties. Clustering of these networks further helps in understanding the natural morphosyntactic classes present in a language, which in turn facilitates the design of POS tagsets andcreation of POS taggers and tag dictionaries in an unsupervised manner. We are also studying theself-organization of the consonant inventories within the framework of complex networks.Since linguistic systems as well as many other natural systems can be modeled as bipartite networks, we are investigating the theory behind the structure and growth of certain special kinds of bipartite network.
Indic Machine Translation: This project explores the issues in automatic translation of text between English and Indian Languages, using statistical machine translation technologies. MLS focuses on three different projects in the Indic Machine Translation. The first oneis oncore MT and involves installing the Tree2String translation system developed by MSR-Redmond (Quirk et al. 2005) in MSRI and training it forHindi and other languages of our interest. In other two projects, we focus on specific issues in MTi) morphosyntactic agreements and ii) Word Sense Disambiguation (WSD) to develop appropriate models for them. These models will then be integrated with the MT system for better performance.
- Rishiraj Saha Roy, Niloy Ganguly, Monojit Choudhury, and Srivatsan Laxman, An IR-based Evaluation Framework for Web Search Query Segmentation, in SIGIR 2012, ACM, August 2012
- K Saravanan, Monojit Choudhury, Raghavendra Udupa, and A Kumaran, An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora, in In Proceedings of LREC 2012, European Language Resources Association, 27 May 2012
- kumarana, narend, Ashwani Sharma, and Vikram Dendi, WikiBhasha:OurExperiences with Multilingual Content Creation Tool for Wikipedia, in Proceedings of Wikipedia Conference India, Wikimedia Foundation, November 2011
- Min Zhang, Haizhou Li, kumarana, and Ming Liu, Report of NEWS 2011 Machine Transliteration Shared Task, in Proceedings of the IJCNLP 2011 Named Entities WorkShop (NEWS-2011), Chiang Mai, Thailand, ACL/SIGPARSE, November 2011
- Dipak L. Chaudhari, Om P. Damani, and Srivatsan Laxman, Lexical Co-occurrence, Statistical Significance, and Word Association, in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), Edinburgh, UK, ACL/SIGPARSE, July 2011
- Nitin Dua, Kanika Gupta, Monojit Choudhury, and Kalika Bali, Query Completion without Query Logs for Song Search , in Companion of World Wide Web Conference, WWW 2011, March 2011
- Nikita Mishra, Rishiraj Saha Roy, Niloy Ganguly, Srivatsan Laxman, and Monojit Choudhury, Unsupervised query segmentation using only query logs, in Proceedings of the Twentieth International World Wide Web Conference (WWW 2011), Companion Volume, Hyderabad, Mar 28-Apr 1, ACM, 2011
- A Kumaran, Mitesh Khapra, and Pushpak Bhattacharyya, Compositional Machine Transliteration, in ACM Transactions on Asian Language Information Processing (TALIP) Journal , Association for Computing Machinery, Inc., January 2011
- A Kumaran, Naren Datha, B Ashok, K Saravanan, Anil Ande, Ashwani Sharma, Sridhar Vedantham, Vidya Natampally, Vikram Dendi, and Sandor Maurice, WikiBABEL: A System for Multilingual Wikipedia Content, in in Proceedings of the 'Collaborative Translation: technology, crowdsourcing, and the translator perspective' Workshop (co-located with AMTA 2010 Conference), Denver, Colorado, Association for Machine Translation in the Americas, 31 October 2010
- Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang, Report of NEWS 2010 Machine Transliteration Shared Task, in the ACL 2010 Named Entities WorkShop (NEWS-2010), Uppsala, Sweden, Association for Computational Linguistics, July 2010
- A Kumaran, Mitesh Khapra, and Haizhou Li, Report of NEWS 2010 Transliteration Mining Shared Task, in the ACL 2010 Named Entities WorkShop (NEWS-2010), Uppsala, Sweden, Association for Computational Linguistics, July 2010
- Mitesh Khapra, A Kumaran, and Pushpak Bhattacharyya, Everybody loves a rich cousin: An empirical study of Transliteration through Bridge Languages, in the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2010), Los Angeles, USA, Association for Computational Linguistics, June 2010
- K Saravanan, Raghavendra Udupa, and A Kumaran, Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining, in the Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India, February 2010
- Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang, Report of NEWS 2009 Machine Transliteration Shared Task, in the ACL/IJCNLP-2009 Named Entities WorkShop (NEWS-2009), Singapore, Singapore, Association for Computational Linguistics, August 2009
- A Kumaran, Naren Datha, K Saravanan, Vikram Dendi, and Sandor Maurice, WikiBABEL: A Wiki-style Platform for Creation of Parallel Data, in the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/IJCNLP-2009), Singapore, Singapore, Association for Computational Linguistics, August 2009
- Raghavendra Udupa, K Saravanan, A Kumaran, and Jagadeesh Jagarlamudi, MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora, in 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, Association for Computational Linguistics, March 2009
- A Kumaran, Ranbeer Makin, Vijay Pattisapu, Shaik Sharif, and Lucy Vanderwende, Evaluating the Quality of Automatically Extracted Synonymy Information, in Journal for Language Technology and Computational Linguistics (JLDV), December 2008
- Kalika Bali, Sankaran Baskaran, and A Kumaran, Dependency Treelet-based Phrasal SMT: Evaluation and Issues in English-Hindi Language Pair, in the 6th International Conference on Natural Language Processing (ICON-2008), Pune, India., December 2008
- Raghavendra Udupa, K Saravanan, A Kumaran, and Jagadeesh Jagarlamudl, Mining Named Entity Transliteration Equivalents from Comparable Corpora, in the 17th ACM conference on Information and knowledge management (CIKM 2008), Napa Valley, USA, Association for Computing Machinery, Inc., October 2008
- A Kumaran, K Saravanan, and Sandor Maurice, WikiBABEL: Community Creation of Multilingual Data, in the WikiSYM 2008 Conference, Porto, Portugal, Association for Computing Machinery, Inc., September 2008
- Tanuja Joshi, Joseph Joy, Tobias Kellner, Udayan Khurana, A Kumaran, and Vibhuti Sengar, Crosslingual Location Search, in the 31st annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2008), Singapore, Singapore, Association for Computing Machinery, Inc., July 2008
- Abhishek Sharma, Ranjita Bhagwan, Monojit Choudhury, Leana Golubchik, Ramesh Govindan, and Geoffrey M. Voelker, Automatic Request Characterization in Internet Services, in Proceedings of the 1st HotMetrics Workshop, Association for Computing Machinery, Inc., June 2008
- K Saravanan and A Kumaran, Some Experiments in Mining Named Entity Transliteration Pairs from Comparable Corpora, in the 2nd International Workshop on Crosslingual Information Access, Hyderabad, India, January 2008
- Jagadeesh Jagarlamudi and A Kumaran, Crosslingual Information Retrieval System for Indian Languages, in the 8th Workshop of the Cross-Language Evaluation Forum (CLEF 2007), Budapest, Hungary, Springer Verlag, September 2007
- A Kumaran and Tobias Kellner, Babel: A Machine Transliteration Workbench, in the 30th annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2007), Amsterdam, Netherlands, Association for Computing Machinery, Inc., July 2007
careers
The Multilingual Systems Group at MSR India strives to do world class research that develops a true natural-language-neutral approach in all aspects of language-based computing, and enables adding Indic-language functionalities in Microsoft products.
Researchers work independently or with a team to conduct high-quality academic research in their field. They work to enhance their presence in their field of research outside of Microsoft, through paper publication, conference attendance, and otherwise interacting with an international academic community. As members of the world-wide MSR family of researchers, they collaborate with researchers in all our labs, and with universities around the world. In addition, in Multilingual Systems Group, they need to interface with other public/private research institutions and government agencies, for coordination and standardization activities.







