Arul Menezes heads the Machine Translation team at Microsoft Research. Over the past 15 years, he has driven Machine Translation at Microsoft Research from a basic research project to a web-scale production service with a variety of offerings for consumers and businesses, and millions of users worldwide, including Bing Translator and the Microsoft Translator Hub customization service, as well as the upcoming Skype Translator product.
Machine Translation Research
The Microsoft MT system has a rich history. The first version, initially developed by Arul between 1999 and 2000 was based on shallow semantic predicate-argument structures known as logical forms that were produced by NLPWIN parser. This LF-based MT system aligned source and target language at the LF level, and learned LF-based tree transforms from a large corpus of parallel data derived from Microsoft software documentation. This system was eventually used to translate the entire Microsoft knowledge base, consisting of 150K+ articles into multiple languages (Spanish, French, German and Japanese) and the KB was published in these languages as raw unedited MT. Much to the surprise of skeptics, this proved wildly popular, and user surveys indicated that the MT content was just as useful for solving user problems as human translations.
Our experience with the LF-system led to a number of specific lessons learned, the most important of which were (1) It proved very difficult to incorporate a strong target language model directly in the search process when searching over semantic structures (2) The pipelined structure of [source parse to source LF to transfer to target LF to generate] enabled errors to compound.
These learnings led directly to our treelet translation developed jointly with Chris Quirk and Colin Cherry. We simplified our representation from LF to dependency trees. Since the trees retained all words, we were able to eliminate the need for a generation module, which enabled the search to operate directly over words, which in turn enabled the incorporation of a target language model directly in the search. The first version of this system, published in 2004, searched over all possible tree rotations, scored by a discriminative reordering model. This was the first syntactic MT system to show BLEU gains (2 BLEU points!) over phrasal SMT.
In 2007 we replaced the discriminative reordering model with an order template model estimated directly from the data, which speeded up decoding dramatically and improved generality. The unlexicalized templates generalized well and were unified at runtime with the lexicalized treelets. In 2008 we extended the templates to handle insertion and deletion of function words, in order to handle syntactic phenomena that are realized lexically in one language but not the other.
Machine Translation Product Development
In 2007 we launched the Windows Live translator (now Bing Translator), a live web service providing free translations on the web, and my personal focus shifted away from basic research and towards delivering our research innovations to users. Today the Bing Translator and Microsoft Translator family of products is used by millions of users and businesses and provides billions of translations daily. The system is still based on the treelet MT approach described above, though there have been many improvements and innovations over the years.
The Microsoft Translator product team integrates research and product development in a single team, covering everything from MT modeling and algorithms to data gathering and delivery of the live web service. This eliminates the traditional “tech transfer” from research to product, and enables the team to get research breakthroughs into customer hands without delay.
A big focus of the team today is the Skype Translator project. In this project we set ourselves an ambitious goal – to enable successful open-domain conversations between Skype users in different parts of the world, speaking different languages. As one might imagine, putting together two error-prone technologies such as speech recognition and machine translation raises some unique challenges.
I have a strong interest in using syntactic and semantic information to address problems in machine understanding of language. Prior to my work in Machine Translation, I worked on Mindnet, a knowledge graph built automatically from text data, including dictionary definitions and example sentences, Encarta articles etc, and the application of Mindnet to questions answering and information retrieval tasks. I have also worked using semantic representations for summarization and recognizing textual entailment.
Prior to joining MSR, I worked for 8 years on several Microsoft products, including Windows CE, Windows 95, MSN, Microsoft Site Server, Microsoft Commercial Internet Server, Windows 3.11 and the Microsoft At Work Fax project. I was educated at the Indian Institute of Technology at Bombay and at Stanford University, where my focus was on programming languages and parallel processing. During this period I developed an efficient Prolog interpreter and co-developed a set of parallel processing extensions to C++ known as COOL. I retain a keen interest in parallel programming and large scale distributed systems, which are, not surprisingly, very useful in training MT systems at web-scale. Early versions of our MT system were trained on a home-brew distributed processing engine I co-developed, though today we use off-the-shelf infrastructure such as Cosmos, MPI, Dryad-Linq and Hadoop.
- Hany Hassan and Arul Menezes, Social Text Normalization using Contextual Graph Random Walks, in The 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013) , Association for Computational Linguistics, 4 August 2013
- Arul Menezes and Chris Quirk, Syntactic Models for Structural Word Insertion and Deletion during Translation, in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Honolulu, Hawaii, October 2008
- Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova, Mei Yang, Bill dolan, Mu Li, Chi-Ho Li, Dongdong Zhang, Long Jiang, and Ming Zhou, The MSR-MSRA MT System for NIST Open Machine Translation 2008 Evaluation, in The 2008 NIST Open Machine Translation Evaluation Workshop, 2008
- Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova, Mei Yang, Bill dolan, Mu Li, Chi-Ho Li, Dongdong Zhang, Long Jiang, Ming Zhou, George Foster, Roland Kuhn, Jing Zheng, Wen Wang, Necip Fazil Ayan, Dimitra Vergyri, Nicolas Scheffer, and Andreas Stolcke, The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation, in The 2008 NIST Open Machine Translation Evaluation Workshop, 2008
- Chris Quirk, Raghavendra Udupa, and Arul Menezes, Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction, in Proceedings of MT Summit XI, European Association for Machine Translation, September 2007
- Arul Menezes and Chris Quirk, Using Dependency Order Templates to Improve Generality in Translation, in Proceedings of the Second Workshop on Statistical Machine Translation at ACL 2007, Association for Computational Linguistics, July 2007
- Chris Quirk and Arul Menezes, Do we need phrases? Challenging the conventional wisdom in Statistical Machine Translation, in Proceedings of HLT-NAACL 2006, ACL/SIGPARSE, May 2006
- Rion Snow, Lucy Vanderwende, and Arul Menezes, Effectively using syntax for recognizing false entailment, Association for Computational Linguistics, May 2006
- Chris Quirk and Arul Menezes, Dependency Treelet Translation: The convergence of statistical and example-based machine translation?, in Machine Translation, vol. 20, pp. 43–65, March 2006
- Xiaodong He, Arul Menezes, Chris Quirk, Anthony Aue, Simon Corston-Oliver, Jianfeng Gao, and Patrick Nguyen, Microsoft Research Treelet Translation System: NIST MT Evaluation 06, National Institute of Standards and Technology , March 2006
- Arul Menezes, Kristina Toutanova, and Chris Quirk, Microsoft research treelet translation system: NAACL 2006 Europarl evaluation, in WMT 2006, 2006
- Lucy Vanderwende, Arul Menezes, and Rion Snow, Microsoft Research at RTE-2: Syntactic Contributions in the Entailment Task: an implementation, in Proceedings of the Second PASCAL Recognising Textual Entailment Challenge Workshop, 2006
- Arul Menezes and Chris Quirk, Microsoft Research Treelet Translation System: IWSLT Evaluation, in Proceedings of the International Workshop on Spoken Language Translation, October 2005
- Lucy Vanderwende, Gary Kacmarcik, Hisami Suzuki, and Arul Menezes, MindNet: an automatically-created lexical resource, in HLT/EMNLP Interactive Demonstrations Proceedings, October 2005
- Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for Computational Linguistics, June 2005
- 鈴木久美, Gary Kacmarcik, Lucy Vanderwende, and Arul Menezes, Mindnet/mnex: Tools for automatic construction and analysis of semantic relations database (意味関係データベースの自動構築と解析のためのツール), in 言語処理学会第11回全国大会論文集, March 2005
- Arul Menezes and Chris Quirk, Dependency treelet translation: the convergence of statistical and example-based machine-translation, in Proceedings of the 10th Machine Translation Summit Workshop on Example-Based Machine Translation, pp. 99–108, 2005
- Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Tree Translation: Syntactically Informed Phrasal SMT, no. MSR-TR-2004-113, November 2004
- Anthony Aue, Arul Menezes, Robert Moore, Chris Quirk, and Eric Ringger, Statistical Machine Translation Using Labeled Semantic Dependency Graphs, ACL/SIGPARSE, October 2004
- Lucy Vanderwende, Michele Banko, and Arul Menezes, Event-centric summary generation, in Working notes of the Document Understanding Conference 2004, ACL, 2004
- Chris Brockett, Takako Aikawa, Anthony Aue, Arul Menezes, Chris Quirk, and Hisami Suzuki, English-Japanese Example-Based Machine Translation Using Abstract Semantic Representations, International Conference on Computational Linguistics, October 2002
- Arul Menezes, Better contextual translation using machine learning, Springer-Verlag, October 2002
- William B. Dolan, Jessie Pinkham, Stephen D. Richardson, and Arul Menezes, Achieving commercial-quality translation with example-based methods, European Association for Machine Translation, September 2001
- William Dolan, Stephen D. Richardson, Arul Menezes, and Monica Corston-Oliver, Overcoming the customization bottleneck using example-based MT, Association for Computational Linguistics, July 2001
- Arul Menezes and Stephen D. Richardson, A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora, Association for Computational Linguistics, January 2001
- William Dolan, Stephen D. Richardson, Arul Menezes, and Monica Corston-Oliver, Overcoming the customization bottleneck using example-based MT , Workshop on Data-Driven Methods in Machine Translation, 2001