Arul Menezes

Arul Menezes

Arul Menezes heads the Machine Translation team at Microsoft Research. Over the past 15 years, he has driven Machine Translation at Microsoft Research from a basic research project to a web-scale production service with a variety of offerings for consumers and businesses, and millions of users worldwide, including Bing Translator and the Microsoft Translator Hub customization service, as well as the upcoming Skype Translator product.

Machine Translation Research

The Microsoft MT system has a rich history. The first version, initially developed by Arul between 1999 and 2000 was based on shallow semantic predicate-argument structures known as logical forms that were produced by NLPWIN parser. This LF-based MT system aligned source and target language at the LF level, and learned LF-based tree transforms from a large corpus of parallel data derived from Microsoft software documentation. This system was eventually used to translate the entire Microsoft knowledge base, consisting of 150K+ articles into multiple languages (Spanish, French, German and Japanese) and the KB was published in these languages as raw unedited MT. Much to the surprise of skeptics, this proved wildly popular, and user surveys indicated that the MT content was just as useful for solving user problems as human translations.

Our experience with the LF-system led to a number of specific lessons learned, the most important of which were  (1) It proved very difficult to incorporate a strong target language model directly in the search process when searching over semantic structures (2) The pipelined structure of [source parse to source LF to transfer to target LF to generate] enabled errors to compound.

These learnings led directly to our treelet translation developed jointly with Chris Quirk and Colin Cherry. We simplified our representation from LF to dependency trees. Since the trees retained all words, we were able to eliminate the need for a generation module, which enabled the search to operate directly over words, which in turn enabled the incorporation of a target language model directly in the search. The first version of this system, published in 2004, searched over all possible tree rotations, scored by a discriminative reordering model. This was the first syntactic MT system to show BLEU gains (2 BLEU points!) over phrasal SMT. 

 In 2007 we replaced the discriminative reordering model with an order template model estimated directly from the data, which speeded up decoding dramatically and improved generality. The unlexicalized templates generalized well and were unified at runtime with the lexicalized treelets. In 2008 we extended the templates to handle insertion and deletion of function words, in order to handle syntactic phenomena that are realized lexically in one language but not the other.

Machine Translation Product Development

In 2007 we launched the Windows Live translator (now Bing Translator), a live web service providing free translations on the web, and my personal focus shifted away from basic research and towards delivering our research innovations to users. Today the Bing Translator and Microsoft Translator family of products is used by millions of users and businesses and provides billions of translations daily. The system is still based on the treelet MT approach described above, though there have been many improvements and innovations over the years.

The Microsoft Translator product team integrates research and product development in a single team, covering everything from MT modeling and algorithms to data gathering and delivery of the live web service. This eliminates the traditional “tech transfer” from research to product, and enables the team to get research breakthroughs into customer hands without delay.

A big focus of the team today is the Skype Translator project. In this project we set ourselves an ambitious goal – to enable successful open-domain conversations between Skype users in different parts of the world, speaking different languages. As one might imagine, putting together two error-prone technologies such as speech recognition and machine translation raises some unique challenges.

NLP Interests

 I have a strong interest in using syntactic and semantic information to address problems in machine understanding of language. Prior to my work in Machine Translation, I worked on Mindnet, a knowledge graph built automatically from text data, including dictionary definitions and example sentences, Encarta articles etc, and the application of Mindnet to questions answering and information retrieval tasks. I have also worked using semantic representations for summarization and recognizing textual entailment.

Other Interests

Prior to joining MSR, I worked for 8 years on several Microsoft products, including Windows CE, Windows 95, MSN, Microsoft Site Server, Microsoft Commercial Internet Server, Windows 3.11 and the Microsoft At Work Fax project. I was educated at the Indian Institute of Technology at Bombay and at Stanford University, where my focus was on programming languages and parallel processing. During this period I developed an efficient Prolog interpreter and co-developed a set of parallel processing extensions to C++ known as COOL. I retain a keen interest in parallel programming and large scale distributed systems, which are, not surprisingly, very useful in training MT systems at web-scale. Early versions of our MT system were trained on a home-brew distributed processing engine I co-developed, though today we use off-the-shelf infrastructure such as Cosmos, MPI, Dryad-Linq and Hadoop.




    • Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova, Mei Yang, Bill dolan, Mu Li, Chi-Ho Li, Dongdong Zhang, Long Jiang, and Ming Zhou, The MSR-MSRA MT System for NIST Open Machine Translation 2008 Evaluation, in The 2008 NIST Open Machine Translation Evaluation Workshop, 2008
    • Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova, Mei Yang, Bill dolan, Mu Li, Chi-Ho Li, Dongdong Zhang, Long Jiang, Ming Zhou, George Foster, Roland Kuhn, Jing Zheng, Wen Wang, Necip Fazil Ayan, Dimitra Vergyri, Nicolas Scheffer, and Andreas Stolcke, The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation, in The 2008 NIST Open Machine Translation Evaluation Workshop, 2008