Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Machine Translation

The principal focus of the Natural Language Processing group is to build a machine translation system that automatically learns translation mappings from bilingual corpora.


Overview

The Machine Translation (MT) project at Microsoft Research is focused on creating MT systems and technologies that cater to the multitude of translation scenarios today. Data driven systems, in particular those with a statistical core engine, have proven to be the most efficient, due to their ability to adapt to a wide domain coverage and being trained in new language pairs within a matter of weeks. This team works closely with research and development partners worldwide, making the system accessible to a variety of products and services.

Research:

Machine Translation has been a major focus of the NLP group since 1999. Our approach to MT has always been “data-driven”. Rather than writing explicit rules to translate natural language, we train our algorithms on human-translated parallel texts, which allows them to automatically learn how to translate. Our first generation Logical Form based system learned translation patterns at the level of abstract parsed structures, and was used to translate the entire Microsoft support knowledge base into several languages. Our recent research has focused on Statistical Machine Translation (SMT).

Syntax-Based SMT. Translating content from English into as many foreign languages as possible is a high priority for Microsoft, not to mention the billions of people around the world who do not read English. The Treelet Translation System leverages an English natural language parser to help guide this process. This technology is currently used in several places across Microsoft, including the Live translation system for computer-related texts and the Microsoft Support site. Ongoing research has produced major improvements in the choice of word inflections and word ordering in this system.

Phrase-Based SMT. Many leading SMT systems do not use any linguistic resources, such as dictionaries, grammars, or parsers. These so called “phrase-based” systems try to learn translations of arbitrary word sequences of words directly from parallel texts. By improving the methods used to prune the search for the best translation in this type of system, we have shown how to findbetter translations in less time than previous systems.

Word Alignment. SMT systems learn translations from existing bodies of translated data. For most modern systems, identifying the word correspondences or word alignments in this translated data is a crucial step in training systems. Our group has produced pioneering work in both discriminative and generative approaches to word alignment, resulting in faster alignment algorithms with state-of-the-art quality.

Language Modeling. Large n-gram language models are a crucial component in high-quality SMT systems. Trained on only target language data, they help translation systems select fluent and readable output. MSRLM is a publicly-available language modeling toolkit developed at MSR. The toolkit is both fast and scalable, training a 5-gram model from more than one billion pre-tokenized words in about 3 hours on a single machine.

MSR MT System

Other research areas:

Some languages have their own special challenges; for instance, word boundaries are not indicated in normal Chinese texts. MSRSeg can both segment Chinese words and identify names of entities such as people and organizations, capabilities that are very useful in machine translation. More detail on our Japanese MT work can be found here.

Currently our systems are trained on parallel texts that supply sentence-for-sentence translations of the original information. We have developed accurate methods of finding parallel sentences among mostly parallel documents. We have also begun research in extracting parallel data from pairs of “comparable” documents, which contain some information in common, but are not direct translations of each other.

Products and Integration Scenarios:

Microsoft Translator, a free translation portal, and a web service that powers many other translation scenarios, is the latest result of the work done by our research and product teams. The goal is to create the simplest, most intuitively integrated and useful translation services available to end users—while making ongoing improvements to translation quality. This service allows Live Search users to translate foreign language search results by clicking on “Translate this Page”. Users can also translate words, search queries, paragraphs or entire web pages through the Microsoft Translator portal. The Bilingual Viewer interface features a unique, side-by-side web page viewer that translates entire Web pages with blinding speed between 25 sets of language pairs. In addition, there is a Windows Live Toolbar Button , an add-in that puts a button on users’ websites, allowing their visitors to translate their web page using our service, and a Windows Live Messenger Translator Bot prototype that lets users translate IM conversations in a number of popular languages.

Portions of the technology behind MSR-MT, including parsing, LFs, MindNet, have been used in the grammar checkers in Word, in the natural language query function of Encarta, and in other MS products.

The system already has proven its value within Microsoft, having been used in 2003 to translate nearly 140,000 customer-support Knowledge Base articles into Spanish (If you go to the web site, click on International Support and choose Spain as your country. You can then enter Spanish queries for the KB and receive back machine-translated hits.) The effort was extended to Japanese the next year and to French and German in 2005. Now, Microsoft’s Knowledge Base materials have been translated into nine languages by MSR-MT. This approach lowered the cost barrier to obtaining customized, higher-quality MT and Microsoft's support group is now able to provide usable translations for its entire online KB. It can also keep current with updates and additions on a weekly basis - something that was previously unthinkable both in terms of time and expense.

You can also visit the MSR Machine Translation blog to keep track of our ongoing product and scenario related work.

 

Select Publications:

  • Anthony Aue, Arul Menezes, Robert Moore, Chris Quirk, Eric Ringger. Statistical Machine Translation Using Labeled Semantic Dependency Graphs October 2004
  • Arul Menezes, Chris Quirk. Microsoft Research Treelet Translation System: IWSLT Evaluation October 2005 Proceedings of the International Workshop on Spoken Language Translation
  • Arul Menezes, Chris Quirk. Using Dependency Order Templates to Improve Generality in Translation July 2007 Proceedings of the Second Workshop on Statistical Machine Translation at ACL 2007
  • Arul Menezes, Stephen D. Richardson. A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora September 2001
  • Arul Menezes, Stephen D. Richardson. A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora January 2001
  • Arul Menezes. Better contextual translation using machine learning October 2002
  • Chris Brockett, Takako Aikawa, Anthony Aue, Arul Menezes, Chris Quirk, Hisami Suzuki. English-Japanese Example-Based Machine Translation Using Abstract Semantic Representations October 2002
  • Chris Quirk, Arul Menezes, Colin Cherry. Dependency Tree Translation: Syntactically Informed Phrasal SMT June 2005 Ann Arbor, MI Proceedings of ACL
  • Chris Quirk, Arul Menezes, Colin Cherry. Dependency Tree Translation: Syntactically Informed Phrasal SMT November 2004
  • Chris Quirk, Arul Menezes. Dependency Treelet Translation: The convergence of statistical and example-based machine translation? March 2006 Machine Translation 43--65 20
  • Chris Quirk, Arul Menezes. Do we need phrases? Challenging the conventional wisdom in Statistical Machine Translation May 2006 New York, New York, USA Proceedings of HLT-NAACL 2006
  • Chris Quirk, Arul Menezes. Dependency Treelet Translation: The convergence of statistical and example-based machine translation? March 2006 Machine Translation 20 pp. 43-65
  • Chris Quirk, Raghavendra Udupa, Arul Menezes. Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction September 2007 Copenhagen, Denmark Proceedings of MT Summit XI
  • Chris Quirk, Simon Corston-Oliver. The impact of parse quality on syntactically-informed statistical machine translation July 2006 Sydney, Australia Proceedings of EMNLP 2006
  • Chris Quirk. Training a Sentence-Level Machine Translation Confidence Measure May 2004
  • David Rojas, Takako Aikawa. Predicting MT Quality as a Function of the Source Language May 2006
  • E. Brill, G. Kacmarcik, C. Brockett. Learning to Extract Katakana-English Word Pairs from Non-Aligned Web Queries Using a Noisy-Channel Model of Back-Transliteration 2001 Proceedings of NLPRS 2001
  • Einat Minkov, Kristina Toutanova, Hisami Suzuki. Generating Complex Morphology for Machine Translation February 2008
  • Hisami Suzuki, Kristina Toutanova. Learning to Predict Case Markers in Japanese July 2006
  • Kristina Toutanova, Hisami Suzuki. Generating Case Markers in Machine Translation April 2007
  • Masaki Itagaki, Takako Aikawa, Anthony Aue. Detecting Inter-domain Semantic Shift using Syntactic Similarity May 2006
  • Masaki Itagaki, Takako Aikawa, Xiaodong He. Automatic Validation of Terminology Translation Consistency with Statistical Method September 2007
  • Patrick Nguyen, Milind Mahajan, Xiaodong He. Training Non-Parametric Features for Statistical Machine Translation June 2007
  • Robert C. Moore, Chris Quirk. Faster Beam-Search Decoding for Phrasal Statistical Machine Translation September 2007 Copenhagen, Denmark Proceedings of MT Summit XI
  • Robert C. Moore, Chris Quirk. An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation July 2007 Prague, Czech Republic Proceedings of the Second Workshop on Statistical Machine Translation at ACL 2007
  • Robert C. Moore, Chris Quirk. Faster Beam-Search Decoding for Phrasal Statistical Machine Translation. September 2007
  • Robert C. Moore, Chris Quirk. An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation July 2007 Proceedings of the Second Workshop on Statistical Machine Translation at ACL 2007
  • Simon Corston-Oliver, Michael Gamon, Eric Ringger, Robert Moore. An overview of Amalgam: A machine-learned generation module. July 2002
  • Takako Aikawa, Lee Schwartz, Ronit King, Monica Corston-Oliver, Carmen Lozano. Impact of controlled language on translation quality and post-editing in a statistical machine translation environment October 2007
  • Takako Aikawa, Maite Melero, Lee Schwartz, Andi Wu. Sentence Generation for Multilingual Machine Translation September 2001
  • William B. Dolan, Jessie Pinkham, Stephen D. Richardson, Arul Menezes. Achieving commercial-quality translation with example-based methods September 2001
  • William Dolan, Stephen D. Richardson, Arul Menezes, Monica Corston-Oliver. Overcoming the customization bottleneck using example-based MT July 2001
  • Xiaodong He, Arul Menezes, Chris Quirk, Anthony Aue, Simon Corston-Oliver, Jianfeng Gao, Patrick Nguyen. Microsoft Research Treelet Translation System: NIST MT Evaluation 06 March 2006
  • Xiaodong He. Using Word-Dependent Transition Models in HMM based Word Alignment for Statistical Machine Translation June 2007
  • Xiaodong He. Using Word Dependent Transition Models in HMM based Word Alignment for Statistical Machine Translation July 2007