*
Quick Links|Home|Worldwide
Microsoft*
Search for


Natural Language Computing

Overview

The information era has brought us vast amounts of digitized text that are generated, propagated, exchanged, stored, and accessed through the Internet each day across the world. The accumulation of this data is making information acquisition increasingly difficult, with language becoming a critical obstacle to growth. To overcome these difficulties, the Natural Language Computing (NLC) Group is focusing its efforts on a variety of research topics, including multi-language text analysis, machine translation, cross language information retrieval, and question answering. Over the years, the group has made significant contributions to Microsoft products, including a Japanese and Chinese Input Method Editor (IME), English writing assistant for Office 2007, Chinese couplet game for Windows Live, Chinese word breaker, pinyin search and search speller for the MSN search engine, text mining for SQL Servers and SharePoint, and meta data extraction for MSN. Our research achievements have been published at most prestigious NLP conferences, including 21 papers at ACL and eight papers at SIGIR, from 2000-2007. This group was awarded MSRA “stamina award” in 2006 due to the above-mentioned excellent achievements.
 

People

Primary Contact: Ming Zhou



Collins,
John
Photo Not Available
Photo Not Available
Jiang,
Long

Li,
Mu


Photo Not Available

   

Affiliate Members







Photo Not Available


   


Areas of Focus

Our research strategy is data driven and statistical learning: we collect large-scale monolingual/bilingual corpora from the web and third parties, and use machine learning approaches to acquire linguistic/translation knowledge. This knowledge is then used to support our research projects. Below is an introduction of our main research areas.

Corpus Collection, Classification, and Annotation

This is a continuous effort to build a large text corpus as the infrastructure for statistical learning. Text can be acquired from various documents and from the Web. Text classification by topic and writing style is useful for the construction of a balanced corpus as well as various domain specific corpora. Corpus annotation is a challenging task. It includes word segmentation, named entity identification, parts-of-speech tagging, syntactic parsing, word sense tagging, and anaphora tagging. The different tagging tools can be used directly in a number of natural language applications. The different annotated corpora can serve as supervised training data for statistical language modeling for different purposes.

Asian Language Natural Language Processing

Text Information Mining and Extraction (TIME) is a platform used to extract key information from a variety of documents such as web pages, word documents, and PowerPoint presentations in different languages. The extracted information can be used to support information retrieval and search engines, machine translation, summarization, and question answering. This innovation covers a variety of technologies such as tokenization, named entity identification, semantic labeling or skeleton information extraction, key term extraction, and summarization.

Statistical Machine Translation

The focus of the Statistical Machine Translation project is on helping and guiding non-native English users, such as Chinese, Japanese and Koreans, search, read and write English more fluently. To this end, the NLC Group has applied statistical machine translation to provide meaningful translation solutions at the word, phrase or collocation, and sentence levels. Supported by translation technologies, the group is conducting research into new applications for search engines,such as Multilingnal Search.This application works at the word level, for inputted queries, and the sentence level, for translation of returned snippets.

Information Retrieval

Our goal is to explore using natural language processing (NLP) technologies to improve the performance of classical information retrieval (IR) including indexing, query suggestion, spelling, and to relevance ranking. We will try these approaches with a vertical domain first and gradually extend to open domains. We have explored the best indexing terms for Chinese, new approaches for query expansion, mining word association and similarity from a text corpus, the fusing method of the retrieval results from different IR systems, base NP identification, accurate query translation using a statistical approach and example-based approaches. We participated in the cross-lingual track of TREC-9 and NTCIR-III and got best results on cross-language information retrieval. We focused on the query translation and optimizing indexing for a Chinese IR system. We also participated in the Web track of TREC-10. Based on above mentioned technologies, we have built a successful linguistic search engine (lingo) for English as a Second Language (ESL) writing.

Question Answering

Question answering is a key technology being developed for the next generation search engine. Given a question, a search engine user hopes to get an exact answer rather than face a huge number of query results. NLC Group is creating question reformulation, paraphrasing, and various answer extraction techniques for factoid questions and non-factoid questions. Based on this work, the group also hopes to build domain specific chatbots with question answering technologies that mine text forums, web blogs, and other web resources.

Language Gaming

Can you imagine a computer capable of generating Chinese couplets? The NLC Group has made this a reality for the first time in the world by creating Chinese Couplet Generation software as part of its language gaming project for the Internet and mobile games (http://duilian.msra.cn). The software works by accepting a sentence provided by a user and then extrapolating a couplet sentence. This technology can be used to further Chinese language learning by entertaining and engaging users.



Selected Publications



©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement