The Redmond-based Natural Language Processing group is focused on developing efficient algorithms to process texts and to make their information accessible to computer applications. Since text can contain information at many different granularities, from simple word or token-based representations, to rich hierarchical syntactic representations, to high-level logical representations across document collections, the group seeks to work at the right level of analysis for the application concerned.
Overview
The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.
This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. It's ironic that natural language, the symbol system that is easiest for humans to learn and use, is hardest for a computer to master. Long after machines have proven capable of inverting large matrices with speed and grace, they still fail to master the basics of our spoken and written languages.
The challenges we face stem from the highly ambiguous nature of natural language. As an English speaker you effortlessly understand a sentence like "Flying planes can be dangerous". Yet this sentence presents difficulties to a software program that lacks both your knowledge of the world and your experience with linguistic structures. Is the more plausible interpretation that the pilot is at risk, or that the danger is to people on the ground? Should "can" be analyzed as a verb or as a noun? Which of the many possible meanings of "plane" is relevant? Depending on context, "plane" could refer to, among other things, an airplane, a geometric object, or a woodworking tool. How much and what sort of context needs to be brought to bear on these questions in order to adequately disambiguate the sentence?
We address these problems using a mix of knowledge-engineered and statistical/machine-learning techniques to disambiguate and respond to natural language input. Our work has implications for applications like text critiquing, information retrieval, question answering, summarization, gaming, and translation. The grammar checkers in Office for English, French, German, and Spanish are outgrowths of our research; Encarta uses our technology to retrieve answers to user questions; Intellishrink uses natural language technology to compress cellphone messages; Microsoft Product Support uses our machine translation software to translate the Microsoft Knowledge Base into other languages. As our work evolves, we expect it to enable any area where human users can benefit by communicating with their computers in a natural way.
Selected current projects
Machine Translation is currently a major focus of the group. In contrast to most existing commercial MT systems, we are pursuing a data-driven approach which all translation knowledge is learned from existing bilingual text.
The ESL Assistant presents a new paradigm of grammar correction in which large-scale statistical models and web services offer writing assistance for learners of English as a second or foreign language. The service is now available online. Additional information can be found on the team website. Updates on the project will also be available from time to time on the ESL Assistant team blog on MSDN.
Recognizing Textual Entailment has been proposed as a generic task that captures major semantic inference needs across many natural language processing applications. In conjunction with our work in this area, we have made available to the research community Manually Word Aligned RTE 2006 Data Sets (described in Brockett, 2007).
Paraphrase recognition and generation are crucial to creating applications that approximate our understanding of language. We have released a corpus of approximately 5000 sentence pairs that have been annotated by humans to indicate whether or not they can be considered paraphrases. Alignment phrase tables created using the data described in Quirk et al. (2004) and Dolan et al. (2004) are now also available for download.
MindNet aims to formalize the representation of word meanings by developing methods for automatically building semantic networks from text and then exploring their structure. MindNets constructed from Japanese and English dictionary data are available for online browsing.
The Japanese NLP project page summarizes areas of research we are working on in processing Japanese.
Older projects
Amalgam is a novel system developed in the Natural Language Processing group at Microsoft Research for sentence realization during natural language generation that employs machine learning techniques. Sentence realization is the process of generating (realizing) a fluent sentence from a semantic representation.
IntelliShrink is a product that uses linguistic analysis to abbreviate an email message so that it can be displayed on a cell phone. IntelliShrink analyses messages in English, French, German or Spanish.
- Microsoft Research Question-Answering Corpus13 November 2008
- Multi-System, Machine-Translated, Word-Order Collection28 March 2008
- NLP Data Sets for Comparative Study of Parameter-Estimation Methods2 June 2007
- Microsoft Research Paraphrase Phrase Tables10 October 2006
- ESL 123 Mass Noun Examples18 July 2006
- Microsoft Research Paraphrase Corpus3 March 2005
- Microsoft Research IME Corpus21 December 2005
- Bilingual Sentence Aligner14 May 2003
- Unification Grammar Sentence Realization Algorithms6 May 2003
- Arnd Christian König, Michael Gamon, and Qiang Wu, Click-Through Prediction for News Queries , in SIGIR'09: the 32nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc., July 2009
- Michael Gamon and Arnd Christian König, Navigation Patterns from and to Social Media, in 3rd AAAI Conference on Weblogs and Social Media, American Association for Artificial Intelligence , May 2009
- Colin Cherry and Chris Quirk, Discriminative, Syntactic Language Modeling through Latent SVMs, in Proceeding of AMTA, Association for Machine Translation in the Americas, 23 October 2008
- Menezes, Arul, Quirk, and Chris, Syntactic Models for Structural Word Insertion and Deletion during Translation, in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Honolulu, Hawaii, October 2008
- Qiang Wu, Christopher J.C. Burges, Krysta Svore, and Jianfeng Gao, Ranking, Boosting, and Model Adaptation, no. MSR-TR-2008-109, October 2008
- Moore, Robert C., Quirk, and Chris, Random Restarts in Minimum Error Rate Training for Statistical Machine Translation, in Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Coling 2008 Organizing Committee, Manchester, UK, August 2008
- Zhang, Hao, Quirk, Chris, Moore, Robert C., Gildea, and Daniel, Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing, in Proceedings of ACL-08: HLT, Association for Computational Linguistics, Columbus, Ohio, June 2008
- Kristina Toutanova, Hisami Suzuki, and Achim Ruopp, Applying Morphology Generation Models to Machine Translation, in Proceedings of ACL, Association for Computational Linguistics, June 2008
- Michael Gamon, Sumit Basu, Dmitriy Belenko, Danyel Fisher, Matthew Hurst, and Arnd Christian König, BLEWS: Using Blogs to Provide Context for News Articles, in 2nd AAAI Conference on Weblogs and Social Media, American Association for Artificial Intelligence , April 2008
- Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende, Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. Proceedings of IJCNLP, Hyderabad, India. , Asia Federation of Natural Language Processing, January 2008
- Kristina Toutanova and Mark Johnson, A Bayesian LDA-based Model for Semi-Supervised Part-of-speech Tagging, in In Proceedings of NIPS, MIT Press, January 2008
- Patrick Nguyen, Jianfeng Gao, and Milind Mahajan, MSRLM: a scalable language modeling toolkit, no. MSR-TR-2007-144, November 2007
- Xiaodong He and Li Deng, Discriminative Learning in Speech Recognition, no. MSR-TR-2007-129, October 2007
- Takako Aikawa, Lee Schwartz, Ronit King, Monica Corston-Oliver, and Carmen Lozano, Impact of controlled language on translation quality and post-editing in a statistical machine translation environment, European Association for Machine Translation, October 2007
- Robert C. Moore and Chris Quirk, Faster Beam-Search Decoding for Phrasal Statistical Machine Translation, in Proceedings of MT Summit XI, European Association for Machine Translation, September 2007
- Chris Quirk, Raghavendra Udupa, and Arul Menezes, Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction, in Proceedings of MT Summit XI, European Association for Machine Translation, September 2007
- Masaki Itagaki, Takako Aikawa, and Xiaodong He, Automatic Validation of Terminology Translation Consistency with Statistical Method, European Association for Machine Translation, September 2007
- Robert C. Moore and Chris Quirk, An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation, in Proceedings of the Second Workshop on Statistical Machine Translation at ACL 2007, Association for Computational Linguistics, July 2007
- Pi-Chuan Chang and Kristina Toutanova, A Discriminative Syntactic Word Order Model for Machine Translation, Association for Computational Linguistics, June 2007
- Chris Brockett, ALIGNING THE RTE 2006 CORPUS, no. MSR-TR-2007-77, June 2007
- Einat Minkov, Kristina Toutanova, and Hisami Suzuki, Generating Complex Morphology for Machine Translation, Association for Computational Linguistics, June 2007
- Xiaodong He, Using Word-Dependent Transition Models in HMM based Word Alignment for Statistical Machine Translation, Association for Computational Linguistics, June 2007
- Patrick Nguyen, Milind Mahajan, and Xiaodong He, Training Non-Parametric Features for Statistical Machine Translation , Association for Computational Linguistics, June 2007
- Kristina Toutanova and Hisami Suzuki, Generating Case Markers in Machine Translation, Association for Computational Linguistics, April 2007
- Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamundi, Hisami Suzuki, and Lucy Vanderwende, The Pythy Summarization System: Microsoft Research at DUC 2007, Association for Computational Linguistics, April 2007



