Natural Language Computing (NLC) Group is focusing its efforts on a variety of research topics, including multi-language text analysis, machine translation, cross language information retrieval, and question answering. Over the years, the group has made significant contributions to Microsoft products, including a Japanese and Chinese Input Method Editor (IME), English writing assistant for Office 2007, Chinese couplet game for Windows Live, Chinese word breaker, pinyin search ...
The information era has brought us vast amounts of digitized text that are generated, propagated, exchanged, stored, and accessed through the Internet each day across the world. The accumulation of this data is making information acquisition increasingly difficult, with language becoming a critical obstacle to growth. To overcome these difficulties, the Natural Language Computing (NLC) Group is focusing its efforts on a variety of research topics, including multi-language text analysis, machine translation, cross language information retrieval, and question answering. Over the years, the group has made significant contributions to Microsoft products, including a Japanese and Chinese Input Method Editor (IME), English writing assistant for Office 2007, Chinese couplet game for Windows Live, Chinese word breaker, pinyin search and search speller for the MSN search engine, text mining for SQL Servers and SharePoint, and meta data extraction for MSN. Our research achievements have been published at most prestigious NLP conferences, including 21 papers at ACL and eight papers at SIGIR, from 2000-2007. This group was awarded MSRA â€œstamina awardâ€ in 2006 due to the above-mentioned excellent achievements.
Areas of Focus
Our research strategy is data driven and statistical learning: we collect large-scale monolingual/bilingual corpora from the web and third parties, and use machine learning approaches to acquire linguistic/translation knowledge. This knowledge is then used to support our research projects. Below is an introduction of our main research areas.
Corpus Collection, Classification, and Annotation
This is a continuous effort to build a large text corpus as the infrastructure for statistical learning. Text can be acquired from various documents and from the Web. Text classification by topic and writing style is useful for the construction of a balanced corpus as well as various domain specific corpora. Corpus annotation is a challenging task. It includes word segmentation, named entity identification, parts-of-speech tagging, syntactic parsing, word sense tagging, and anaphora tagging. The different tagging tools can be used directly in a number of natural language applications. The different annotated corpora can serve as supervised training data for statistical language modeling for different purposes.
Asian Language Natural Language Processing
Text Information Mining and Extraction (TIME) is a platform used to extract key information from a variety of documents such as web pages, word documents, and PowerPoint presentations in different languages. The extracted information can be used to support information retrieval and search engines, machine translation, summarization, and question answering. This innovation covers a variety of technologies such as tokenization, named entity identification, semantic labeling or skeleton information extraction, key term extraction, and summarization.
Statistical Machine Translation
The focus of the Statistical Machine Translation project is on helping and guiding non-native English users, such as Chinese, Japanese and Koreans, search, read and write English more fluently. To this end, the NLC Group has applied statistical machine translation to provide meaningful translation solutions at the word, phrase or collocation, and sentence levels. Supported by translation technologies, the group is conducting research into new applications for search engines,such as Multilingnal Search.This application works at the word level, for inputted queries, and the sentence level, for translation of returned snippets.
Our goal is to explore using natural language processing (NLP) technologies to improve the performance of classical information retrieval (IR) including indexing, query suggestion, spelling, and to relevance ranking. We will try these approaches with a vertical domain first and gradually extend to open domains. We have explored the best indexing terms for Chinese, new approaches for query expansion, mining word association and similarity from a text corpus, the fusing method of the retrieval results from different IR systems, base NP identification, accurate query translation using a statistical approach and example-based approaches. We participated in the cross-lingual track of TREC-9 and NTCIR-III and got best results on cross-language information retrieval. We focused on the query translation and optimizing indexing for a Chinese IR system. We also participated in the Web track of TREC-10. Based on above mentioned technologies, we have built a successful linguistic search engine (lingo) for English as a Second Language (ESL) writing.
Question answering is a key technology being developed for the next generation search engine. Given a question, a search engine user hopes to get an exact answer rather than face a huge number of query results. NLC Group is creating question reformulation, paraphrasing, and various answer extraction techniques for factoid questions and non-factoid questions. Based on this work, the group also hopes to build domain specific chatbots with question answering technologies that mine text forums, web blogs, and other web resources.
Can you imagine a computer capable of generating Chinese couplets? The NLC Group has made this a reality for the first time in the world by creating Chinese Couplet Generation software as part of its language gaming project for the Internet and mobile games (http://duilian.msra.cn). The software works by accepting a sentence provided by a user and then extrapolating a couplet sentence. This technology can be used to further Chinese language learning by entertaining and engaging users.
- Ya-Juan Lv,Ming Zhou,"Collocation Translation Acquisition Using Monolingual Corpora", 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, Jul. 2004.
- Dong-Hui Feng, Ya-Juan Lv, Ming Zhou,"A New Approach for English-Chinese Named Entity Alignment", 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, Jul. 2004.
- Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu and Guihong Cao."Dependence language model for information retrieval", In SIGIR-2004. Sheffield, UK, July 25-29, 2004.
- Jianfeng Gao, Andi Wu, Mu Li, Chang-Ning Huang, Hongqiao Li, Xinsong Xia and Haowei Qin."Adaptive Chinese word segmentation" , 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, Jul. 2004.
- Jianfeng Gao and Hisami Suzuki,"Capturing long distance dependency for language modeling: an empirical study", In IJCNLP-04. Sanya City, Hainan Island, China, March 22-24, 2004.
- Hongqiao Li, Chang-Ning Huang, Jianfeng Gao and Xiaozhong Fan, "The use of SVM for Chinese new word identification", In IJCNLP-04. Sanya City, Hainan Island, China, March 22-24, 2004.
- Hang Li and Cong Li," Word Translation Disambiguation Using Bilingual Bootstrapping", Computational Linguistics 30(1), 1-22, 2004.
- Qiang Yang, Charles X. Ling and Jianfeng Gao. "Mining web logs for actionable knowledge". To appear as a book chapter.
- Ya-JJianfeng Gao, Mu Li and Chang-Ning Huang, "Improved Source-Channel Models for Chinese Word Segmentation", 41nd Annual Meeting of the Association for Computational Linguistics. Sapporo. Japan, July 7-12, 2003.
- Cong Li, Ji-Rong Wen, and Hang Li, "Text Classification Using Stochastic Keyword Generation", Proc. of ICML'03, 464-471.
- Yunbo Cao, Hang Li, and Li Lian, "Uncertainty Reduction in Collaborative Bootstrapping: Measure and Algorithm", Proc. of ACL'03, 327-334.
- Hang Li, Yunbo Cao, and Cong Li,"Using Bilingual Web Data to Mine and Rank Translations", IEEE Intelligent Systems, Vol. 18(4), 54-59, (2003).
- Hang Li and Kenji Yamanishi, "Topic Analysis Using a Finite Mixture Model", Information Processing & Management, 39(4), 521-541, (2003).