|
Machine Translation
Overview
The Machine Translation (MT) project at Microsoft Research is focused on creating MT systems and technologies that cater to the multitude of translation scenarios today. Data driven systems, in particular those with a statistical core engine, have proven to be the most efficient, due to their ability to adapt to a wide domain coverage and being trained in new language pairs within a matter of weeks. This team works closely with research and development partners worldwide, making the system accessible to a variety of products and services. Research:
Machine Translation has been a major focus of the
NLP group since 1999. Our
approach to MT has always been “data-driven”.
Rather than writing explicit rules to translate natural language,
we train our algorithms on human-translated parallel texts, which allows
them to automatically learn how to translate. Our first generation
Logical Form based system learned translation patterns at the level of
abstract parsed structures, and was used to translate the entire
Microsoft support knowledge base
into several languages. Our
recent research has focused on Statistical Machine Translation (SMT).
Syntax-Based SMT. Translating content from English into as many
foreign languages as possible is a high priority for Microsoft, not to
mention the billions of people around the world who do not read English.
The Treelet Translation System leverages an English natural language
parser to help guide this process.
This technology is currently used in several places across
Microsoft, including the Live
translation system for computer-related texts and the
Microsoft Support site.
Ongoing research has produced major improvements in the choice of word
inflections and word
ordering in this system.
Phrase-Based SMT. Many leading SMT systems do not use any
linguistic resources, such as dictionaries, grammars, or parsers. These
so called “phrase-based” systems try to learn translations of arbitrary
word sequences of words directly from parallel texts.
By improving the methods used to prune the search for the best
translation in this type of system, we have shown how to find
better translations in less time than previous systems.
Word Alignment. SMT systems learn translations from existing
bodies of translated data. For
most modern systems, identifying the word correspondences or word
alignments in this translated data is a crucial step in training
systems. Our group has
produced pioneering work in both
discriminative and
generative approaches to word alignment, resulting in faster
alignment algorithms with state-of-the-art quality.
Language Modeling. Large n-gram language models are a crucial
component in high-quality SMT systems.
Trained on only target language data, they help translation
systems select fluent and readable output.
MSRLM is a publicly-available language modeling toolkit developed at
MSR. The toolkit is both fast and scalable, training a 5-gram model from
more than one billion pre-tokenized words in about 3 hours on a single
machine. MSR MT System
Other research areas Some languages have their own special challenges;
for instance, word boundaries are not indicated in normal Chinese texts.
MSRSeg can both segment Chinese words and identify names of entities
such as people and organizations, capabilities that are very useful in
machine translation. Currently our systems are trained on parallel texts
that supply sentence-for-sentence translations of the original
information. We have
developed
accurate methods of finding parallel sentences among mostly parallel
documents. We have also
begun research in
extracting parallel data from pairs of “comparable” documents, which
contain some information in common, but are not direct translations of
each other. Products and Integration Scenarios:
Windows Live Translator, a free translation portal, and a web service that powers many other translation scenarios, is the latest result of the work done by our research and product teams. The goal is to create the simplest, most intuitively integrated and useful translation services available to end users—while making ongoing improvements to translation quality. This service allows Live Search users to translate foreign language search results by clicking on “Translate this Page”. Users can also translate words, search queries, paragraphs or entire web pages through its Windows Live Translator portal. The Bilingual Viewer interface features a unique, side-by-side web page viewer that translates entire Web pages with blinding speed between 25 sets of language pairs. In addition, there is a Windows Live Toolbar Button , a widget that puts a button on users’ websites, allowing their visitors to translate their web page using our service, and a Windows Live Messenger Translator Bot prototype that lets users translate IM conversations in a number of popular languages. Portions of the technology behind MSR-MT, including parsing, LFs, MindNet, have been used in the grammar checkers in Word, in the natural language query function of Encarta, and in other MS products. The system already has proven its value within
Microsoft, having been used in 2003 to translate nearly 140,000
customer-support
Knowledge Base articles
into Spanish (If you go to the web site, click on International Support and
choose Spain as your country. You can then enter Spanish queries for the KB and
receive back machine-translated hits.) The effort was extended to Japanese the
next year and to French and German in 2005. Now, Microsoft’s Knowledge Base
materials have been translated into nine languages by MSR-MT. This approach lowered the cost barrier to obtaining customized, higher-quality MT and
Microsoft's support group is now able to provide usable translations for its entire online KB. It can also keep current with updates and additions on a weekly basis - something that was previously unthinkable both in terms of time and expense.
You can also visit the MSR Machine Translation blog to keep track of our ongoing product and scenario related work. People
Select Publications:
Associated Groups
|
||||||||||||||||||||||||||||||||||