Multilingual Systems

The Multilingual Systems Group explores software technologies to enable seamless content-creation, storage, search, access, and interaction with multiple languages.

















Research Overview

At the Multilingual Systems (MLS) group in Microsoft Research India, we believe that Multilingual Information Access is critical for the acquisition, dissemination, exchange, and understanding of knowledge in the global information society. The accelerated growth in the size, content and reach of Internet, the diversity of user demographics and the skew in the availability of information across languages, all point to the increasingly critical need for Multilingual Information Access. We are an interdisciplinary research group focusing on technologies for Multilingual Information Access such as Cross-language Information Retrieval, Multilingual Information Extraction, and Machine Translation that bridge the gap between available information and the user needs transparently across languages.

To this end, we carry out cutting-edge research on a) several aspects of Cross-language Information Retrieval and Multilingual Information Extraction including query expansion, domain adaptation, automatic alignment of multilingual corpora, multilingual named entity extraction, and machine transliteration b) automated and collaborative creation of parallel corpora for Machine Translation and c) fundamental properties of languages and language phenomena including language acquisition and evolution, structural properties of corpora in the framework of complex networks, and interaction between syntax and prosody.

In addition, we are interested in robust fundamentals, especially, annotation standards, data collection efforts and basic tools for research in Indian languages.More importantly, we would like to enable and be a part ofa strong research eco-system in Multilingual Information Access in India. 



CLIR-Cross Lingual Information Retrieval: As the heterogeneity of the data available on the web increases, it is becoming increasingly important to enable the user to access the relevant information available across languages. Cross Lingual Information Retrieval research focuses on data driven approaches tohelpusers in easily organizing and accessingthe information when needed. Parallel data, even though very useful, is expensive to obtainand hence it is not available in sufficient amount. On the other hand comparable corpora, documents talking about sametopic in different languages, are available in abundant. Here we try to exploit both these resources to bridge the language barrier for Cross-lingual applications.

Currently we are experimenting with Hindi to English CLIR task, where the query will be expressed in Hindi and the relevant documents need to be extracted from English document collection. The query in Hindi is translated into English using word by word translation with the help of statistically learnedword alignments. The relevant documents are then retrieved using a Language Modelling based retrieval algorithm. On CLEF 2007 data set we found that the cross lingual system was able to achieve 73.4% of the monolingual IR systems performance.

Along with this we are also working on domain adaptation, cross lingual text modelling , automatic alignment of comparable corpora and machine transliteration


Mining Linguistic Data: In this project area, we attempt to mine the Internet for parallel data. Our mining techniques are statistical in nature, in contrast to the ad-hoc heuristics typically used in mining internet data,.Specifically, we mine multilingual news-wire articleson the same time-scale, to yield specific types of parallel data, namely, Parallel Sentences andSub-sentential Fragmentsand Named Entity TransliterationsOur experiments in many language pairs from different language families (for example, English-Spanish, English-Russian, English-Hindi and English-Kannada), indicated great potential for such approaches, in addition to highlighting the applicability of our methods to many of the world's languages


IL-POST-Indian Language Parts of Speech Tagset: Parts-of-Speech (POS) tagging is an important process for most Natural Language Processing (NLP) tasks. POS annotations capture the morphosyntactic features of the words from the given context in a text and hence can provide useful information for subsequent stages of processing such as chunking, named entity detection, and parsing.
As a part of this project we have designed IL-POST, a universal Parts-of-Speech tagset framework covering most of the Indian languages (ILs) following a hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses the question of fine granularity of the tags balanced with user or language specific needs in an efficient and principled manner. We follow a hierarchical schema that enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility
This project isa collaborative effort between Linguists, Computational Linguists, and Computer Scientists fromMSRI, AU-KBC, Delhi University, IIT Bombay, Jawaharlal Nehru university, and Tamil University. 

Multilingual Systems Group has released POS-tagged data in two Indian languages, namely Hindi and Bangla for the research community. To learn more about IL-POST, and how to obtain the data and related tools, please click here.


Linguistic Corpora Collection-wikiBabel: Language independent methodologies in Computational Linguistics and NLPpave the way for quick adaptation of language technologies across languages.Hence, such approaches might prove to be of great advantage to resource-poor languages, as a generic system may be adopted for many languages, quickly andWikiBabel transparently.While the linguistic data required for training such systems are hand-created by linguists, in several cases, it is possible that such data may be created by those who are fluent in a language. wikiBABEL is a project in the MLS group, as a generic platform for collecting parallel linguistic data from the Internet population.The objective of this project is to induce the Internet population to contribute parallel data which may be used for training the Statistical MT system being developed in MSR. 


  WikiBABEL is available as Alpha in http://translator/WikiBABEL.


Complex Networks and Language: In recent times networks are being extensively used for modeling complex systems. We apply complex network theory to understand the structure and evolution of human languages. We construct word networks based on the distributional hypothesis and study their universal properties. Clustering of these networks further helps in understanding the natural morphosyntactic classes present in a language, which in turn facilitates the design of POS tagsets andcreation of POS taggers and tag dictionaries in an unsupervised manner. We are also studying theself-organization of the consonant inventories within the framework of complex networks.Since linguistic systems as well as many other natural systems can be modeled as bipartite networks, we are investigating the theory behind the structure and growth of certain special kinds of bipartite network.


Indic Machine Translation: This project explores the issues in automatic translation of text between English and Indian Languages, using statistical machine translation technologies. MLS focuses on three different projects in the Indic Machine Translation. The first oneis oncore MT and involves installing the Tree2String translation system developed by MSR-Redmond (Quirk et al. 2005) in MSRI and training it forHindi and other languages of our interest. In other two projects, we focus on specific issues in MTi) morphosyntactic agreements and ii) Word Sense Disambiguation (WSD) to develop appropriate models for them. These models will then be integrated with the MT system for better performance.




The Multilingual Systems Group at MSR India strives to do world class research that develops a true natural-language-neutral approach in all aspects of language-based computing, and enables adding Indic-language functionalities in Microsoft products.

Researchers work independently or with a team to conduct high-quality academic research in their field. They work to enhance their presence in their field of research outside of Microsoft, through paper publication, conference attendance, and otherwise interacting with an international academic community. As members of the world-wide MSR family of researchers, they collaborate with researchers in all our labs, and with universities around the world. In addition, in Multilingual Systems Group, they need to interface with other public/private research institutions and government agencies, for coordination and standardization activities.