The Multilingual Systems Group explores software technologies to enable seamless content-creation, storage, search, access, and interaction with multiple languages.
At the Multilingual Systems (MLS) group in Microsoft Research India, we believe that Multilingual Information Access is critical for the acquisition, dissemination, exchange, and understanding of knowledge in the global information society. The accelerated growth in the size, content and reach of Internet, the diversity of user demographics and the skew in the availability of information across languages, all point to the increasingly critical need for Multilingual Information Access. We are an interdisciplinary research group focusing on technologies for Multilingual Information Access such as Cross-language Information Retrieval, Multilingual Information Extraction, and Machine Translation that bridge the gap between available information and the user needs transparently across languages.
To this end, we carry out cutting-edge research on a) several aspects of Cross-language Information Retrieval and Multilingual Information Extraction including query expansion, domain adaptation, automatic alignment of multilingual corpora, multilingual named entity extraction, and machine transliteration b) automated and collaborative creation of parallel corpora for Machine Translation and c) fundamental properties of languages and language phenomena including language acquisition and evolution, structural properties of corpora in the framework of complex networks, and interaction between syntax and prosody.
In addition, we are interested in robust fundamentals, especially, annotation standards, data collection efforts and basic tools for research in Indian languages.More importantly, we would like to enable and be a part ofa strong research eco-system in Multilingual Information Access in India.
CLIR-Cross Lingual Information Retrieval: As the heterogeneity of the data available on the web increases, it is becoming increasingly important to enable the user to access the relevant information available across languages. Cross Lingual Information Retrieval research focuses on data driven approaches tohelpusers in easily organizing and accessingthe information when needed. Parallel data, even though very useful, is expensive to obtainand hence it is not available in sufficient amount. On the other hand comparable corpora, documents talking about sametopic in different languages, are available in abundant. Here we try to exploit both these resources to bridge the language barrier for Cross-lingual applications.
Currently we are experimenting with Hindi to English CLIR task, where the query will be expressed in Hindi and the relevant documents need to be extracted from English document collection. The query in Hindi is translated into English using word by word translation with the help of statistically learnedword alignments. The relevant documents are then retrieved using a Language Modelling based retrieval algorithm. On CLEF 2007 data set we found that the cross lingual system was able to achieve 73.4% of the monolingual IR systems performance.
Along with this we are also working on domain adaptation, cross lingual text modelling , automatic alignment of comparable corpora and machine transliteration
Mining Linguistic Data: In this project area, we attempt to mine the Internet for parallel data. Our mining techniques are statistical in nature, in contrast to the ad-hoc heuristics typically used in mining internet data,.Specifically, we mine multilingual news-wire articleson the same time-scale, to yield specific types of parallel data, namely, Parallel Sentences andSub-sentential Fragmentsand Named Entity TransliterationsOur experiments in many language pairs from different language families (for example, English-Spanish, English-Russian, English-Hindi and English-Kannada), indicated great potential for such approaches, in addition to highlighting the applicability of our methods to many of the world's languages
IL-POST-Indian Language Parts of Speech Tagset: Parts-of-Speech (POS) tagging is an important process for most Natural Language Processing (NLP) tasks. POS annotations capture the morphosyntactic features of the words from the given context in a text and hence can provide useful information for subsequent stages of processing such as chunking, named entity detection, and parsing.
As a part of this project we have designed IL-POST, a universal Parts-of-Speech tagset framework covering most of the Indian languages (ILs) following a hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses the question of fine granularity of the tags balanced with user or language specific needs in an efficient and principled manner. We follow a hierarchical schema that enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility
This project isa collaborative effort between Linguists, Computational Linguists, and Computer Scientists fromMSRI, AU-KBC, Delhi University, IIT Bombay, Jawaharlal Nehru university, and Tamil University.
Multilingual Systems Group has released POS-tagged data in two Indian languages, namely Hindi and Bangla for the research community. To learn more about IL-POST, and how to obtain the data and related tools, please click here.
Linguistic Corpora Collection-wikiBabel: Language independent methodologies in Computational Linguistics and NLPpave the way for quick adaptation of language technologies across languages.Hence, such approaches might prove to be of great advantage to resource-poor languages, as a generic system may be adopted for many languages, quickly and transparently.While the linguistic data required for training such systems are hand-created by linguists, in several cases, it is possible that such data may be created by those who are fluent in a language. wikiBABEL is a project in the MLS group, as a generic platform for collecting parallel linguistic data from the Internet population.The objective of this project is to induce the Internet population to contribute parallel data which may be used for training the Statistical MT system being developed in MSR.
WikiBABEL is available as Alpha in http://translator/WikiBABEL.
Complex Networks and Language: In recent times networks are being extensively used for modeling complex systems. We apply complex network theory to understand the structure and evolution of human languages. We construct word networks based on the distributional hypothesis and study their universal properties. Clustering of these networks further helps in understanding the natural morphosyntactic classes present in a language, which in turn facilitates the design of POS tagsets andcreation of POS taggers and tag dictionaries in an unsupervised manner. We are also studying theself-organization of the consonant inventories within the framework of complex networks.Since linguistic systems as well as many other natural systems can be modeled as bipartite networks, we are investigating the theory behind the structure and growth of certain special kinds of bipartite network.
Indic Machine Translation: This project explores the issues in automatic translation of text between English and Indian Languages, using statistical machine translation technologies. MLS focuses on three different projects in the Indic Machine Translation. The first oneis oncore MT and involves installing the Tree2String translation system developed by MSR-Redmond (Quirk et al. 2005) in MSRI and training it forHindi and other languages of our interest. In other two projects, we focus on specific issues in MTi) morphosyntactic agreements and ii) Word Sense Disambiguation (WSD) to develop appropriate models for them. These models will then be integrated with the MT system for better performance.
- A Kumaran, Melissa Dunsmore, and Shaishav Kumar, Online Gaming for Crowdsourcing Phrase-equivalents, in the 25th International Conference on Computational Linguistics , ACL – Association for Computational Linguistics, August 2014
- Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly, and Monojit Choudhury, Automatic Discovery of Adposition Typology, in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Coling 2014, August 2014
- Rohan Ramanath, Monojit Choudhury, Kalika Bali, and Rishiaj Saha Roy, Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation, in Proceedings of ACL, Association for Computational Linguistics, July 2013
- Rishiraj Saha Roy, Niloy Ganguly, Monojit Choudhury, and Srivatsan Laxman, An IR-based Evaluation Framework for Web Search Query Segmentation, in SIGIR 2012, ACM, August 2012
- A Kumaran, Sujay Kumar Jauhar, and Sumit Basu, Doodling: A Gaming Paradigm for Generating Language Data, in proceedings of the Human Computation Workshop 2012, American Association for Artificial Intelligence , July 2012
- Min Zhang, Haizhou Li, A Kumaran, and Ming Liu, Report of NEWS 2012 Machine Transliteration Shared Task, in proceedings of the ACL 2012 Named Entities WorkShop (NEWS), Jeju Island, South Korea, Association for Computational Linguistics, June 2012
- K Saravanan, Monojit Choudhury, Raghavendra Udupa, and A Kumaran, An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora, in In Proceedings of LREC 2012, European Language Resources Association, May 2012
- Jagadeesh Jagarlamudi, Hal Daume, and Raghavendra Udupa, Incorporating Lexical Priors into Topic Models, in EACL 2012, ACL/SIGPARSE, 2012
- Santosh Vysyaraju and Raghavendra Udupa, Extracting Advertising Keywords from URL Strings, in WWW 2012, ACM, 2012
- kumarana, narend, Ashwani Sharma, and Vikram Dendi, WikiBhasha:OurExperiences with Multilingual Content Creation Tool for Wikipedia, in Proceedings of Wikipedia Conference India, Wikimedia Foundation, November 2011
- Min Zhang, Haizhou Li, kumarana, and Ming Liu, Report of NEWS 2011 Machine Transliteration Shared Task, in Proceedings of the IJCNLP 2011 Named Entities WorkShop (NEWS-2011), Chiang Mai, Thailand, ACL/SIGPARSE, November 2011
- Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in IJCAI-11, IJCAI, 20 July 2011
- Dipak L. Chaudhari, Om P. Damani, and Srivatsan Laxman, Lexical Co-occurrence, Statistical Significance, and Word Association, in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), Edinburgh, UK, ACL/SIGPARSE, July 2011
- Jagadeesh Jagarlamudi, Raghavendra Udupa, and Hal Daumé III, Generalization of CCA via Spectral Embedding, in The Learning Workshop along with AISTATS 2011, AISTATS, July 2011
- Jagadeesh Jagarlamudi, Raghavendra Udupa, Hal Daumé III, and Abhijit Bhole, Improving Bilingual Projections via Sparse Covariance Matrices, in EMNLP 2011, Association for Computational Linguistics, July 2011
- Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa, From Bilingual Dictionaries to Interlingual Document Representations, in ACL HLT 2011, Association for Computational Linguistics, June 2011
- Nitin Dua, Kanika Gupta, Monojit Choudhury, and Kalika Bali, Query Completion without Query Logs for Song Search , in Companion of World Wide Web Conference, WWW 2011, March 2011
- A Kumaran, Mitesh Khapra, and Pushpak Bhattacharyya, Compositional Machine Transliteration, in ACM Transactions on Asian Language Information Processing (TALIP) Journal , Association for Computing Machinery, Inc., January 2011
- Nikita Mishra, Rishiraj Saha Roy, Niloy Ganguly, Srivatsan Laxman, and Monojit Choudhury, Unsupervised query segmentation using only query logs, in Proceedings of the Twentieth International World Wide Web Conference (WWW 2011), Companion Volume, Hyderabad, Mar 28-Apr 1, ACM, 2011
- Abhijit Bhole, Goutham Tholpadi, and Raghavendra Udupa, Mining Multi-word NEs from comparable corpora, in NEWS 2011, ACL/SIGPARSE, 2011
- A Kumaran, Naren Datha, B Ashok, K Saravanan, Anil Ande, Ashwani Sharma, Sridhar Vedantham, Vidya Natampally, Vikram Dendi, and Sandor Maurice, WikiBABEL: A System for Multilingual Wikipedia Content, in in Proceedings of the 'Collaborative Translation: technology, crowdsourcing, and the translator perspective' Workshop (co-located with AMTA 2010 Conference), Denver, Colorado, Association for Machine Translation in the Americas, 31 October 2010
- Shaishav Kumar and Raghavendra Udupa, Multilingual People Search, in SIGIR 2010, July 2010
- Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang, Report of NEWS 2010 Machine Transliteration Shared Task, in the ACL 2010 Named Entities WorkShop (NEWS-2010), Uppsala, Sweden, Association for Computational Linguistics, July 2010
- A Kumaran, Mitesh Khapra, and Haizhou Li, Report of NEWS 2010 Transliteration Mining Shared Task, in the ACL 2010 Named Entities WorkShop (NEWS-2010), Uppsala, Sweden, Association for Computational Linguistics, July 2010
- Santosh Vysyaraju, Shaishav Kumar, and Raghavendra Udupa, Suggesting Related Topics in Web Search, in SIGIR 2010, July 2010
The Multilingual Systems Group at MSR India strives to do world class research that develops a true natural-language-neutral approach in all aspects of language-based computing, and enables adding Indic-language functionalities in Microsoft products.
Researchers work independently or with a team to conduct high-quality academic research in their field. They work to enhance their presence in their field of research outside of Microsoft, through paper publication, conference attendance, and otherwise interacting with an international academic community. As members of the world-wide MSR family of researchers, they collaborate with researchers in all our labs, and with universities around the world. In addition, in Multilingual Systems Group, they need to interface with other public/private research institutions and government agencies, for coordination and standardization activities.