The Redmond-based Natural Language Processing group is focused on developing efficient algorithms to process texts and to make their information accessible to computer applications. Since text can contain information at many different granularities, from simple word or token-based representations, to rich hierarchical syntactic representations, to high-level logical representations across document collections, the group seeks to work at the right level of analysis for the application concerned.
The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.
This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. It's ironic that natural language, the symbol system that is easiest for humans to learn and use, is hardest for a computer to master. Long after machines have proven capable of inverting large matrices with speed and grace, they still fail to master the basics of our spoken and written languages.
The challenges we face stem from the highly ambiguous nature of natural language. As an English speaker you effortlessly understand a sentence like "Flying planes can be dangerous". Yet this sentence presents difficulties to a software program that lacks both your knowledge of the world and your experience with linguistic structures. Is the more plausible interpretation that the pilot is at risk, or that the danger is to people on the ground? Should "can" be analyzed as a verb or as a noun? Which of the many possible meanings of "plane" is relevant? Depending on context, "plane" could refer to, among other things, an airplane, a geometric object, or a woodworking tool. How much and what sort of context needs to be brought to bear on these questions in order to adequately disambiguate the sentence?
We address these problems using a mix of knowledge-engineered and statistical/machine-learning techniques to disambiguate and respond to natural language input. Our work has implications for applications like text critiquing, information retrieval, question answering, summarization, gaming, and translation. The grammar checkers in Office for English, French, German, and Spanish are outgrowths of our research; Encarta uses our technology to retrieve answers to user questions; Intellishrink uses natural language technology to compress cellphone messages; Microsoft Product Support uses our machine translation software to translate the Microsoft Knowledge Base into other languages. As our work evolves, we expect it to enable any area where human users can benefit by communicating with their computers in a natural way.
Selected current projects
Machine Translation is currently a major focus of the group. In contrast to most existing commercial MT systems, we are pursuing a data-driven approach which all translation knowledge is learned from existing bilingual text.
The ESL Assistant presents a new paradigm of grammar correction in which large-scale statistical models and web services offer writing assistance for learners of English as a second or foreign language. The service is now available online. Additional information can be found on the team website. Updates on the project will also be available from time to time on the ESL Assistant team blog on MSDN.
Recognizing Textual Entailment has been proposed as a generic task that captures major semantic inference needs across many natural language processing applications. In conjunction with our work in this area, we have made available to the research community Manually Word Aligned RTE 2006 Data Sets (described in Brockett, 2007).
Paraphrase recognition and generation are crucial to creating applications that approximate our understanding of language. We have released a corpus of approximately 5000 sentence pairs that have been annotated by humans to indicate whether or not they can be considered paraphrases. Alignment phrase tables created using the data described in Quirk et al. (2004) and Dolan et al. (2004) are now also available for download.
MindNet aims to formalize the representation of word meanings by developing methods for automatically building semantic networks from text and then exploring their structure. MindNets constructed from Japanese and English dictionary data are available for online browsing.
The Japanese NLP project page summarizes areas of research we are working on in processing Japanese.
Amalgam is a novel system developed in the Natural Language Processing group at Microsoft Research for sentence realization during natural language generation that employs machine learning techniques. Sentence realization is the process of generating (realizing) a fluent sentence from a semantic representation.
IntelliShrink is a product that uses linguistic analysis to abbreviate an email message so that it can be displayed on a cell phone. IntelliShrink analyses messages in English, French, German or Spanish.
- Microsoft Research Question-Answering Corpus13 November 2008
- Multi-System, Machine-Translated, Word-Order Collection28 March 2008
- NLP Data Sets for Comparative Study of Parameter-Estimation Methods2 June 2007
- Microsoft Research Paraphrase Phrase Tables10 October 2006
- ESL 123 Mass Noun Examples18 July 2006
- Microsoft Research Paraphrase Corpus3 March 2005
- Microsoft Research IME Corpus21 December 2005
- Bilingual Sentence Aligner14 May 2003
- Unification Grammar Sentence Realization Algorithms6 May 2003
- Moontae Lee, Xiaodong He, Wen-tau Yih, Jianfeng Gao, Li Deng, and Paul Smolensky, Reasoning in Vector Space: An Exploratory Study of Question Answering, in Proceedings of the International Conference on Learning Representations (ICLR) 2016, 2 May 2016.
- Huan Sun, Hao Ma, Xiaodong He, Wen-tau Yih, Yu Su, and Xifeng Yan, Table Cell Search for Question Answering, in Proceedings of the companion publication of the 25th international conference on World Wide Web, ACM – Association for Computing Machinery, 11 April 2016.
- Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell, Visual Storytelling, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL – Association for Computational Linguistics, 1 April 2016.
- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan, A Diversity-Promoting Objective Function for Neural Conversation Models, in NAACL HLT 2016 (forthcoming), March 2016.
- Sauleh Eetemadi, William Lewis, Kristina Toutanova, and Hayder Radha, Survey of Data-Selection Methods in Statistical Machine Translation, in Machine Translation, Springer, December 2015.
- Yi Yang, Wen-tau Yih, and Christopher Meek, WikiQA: A Challenge Dataset for Open-Domain Question Answering, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ACL – Association for Computational Linguistics, 21 September 2015.
- Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon, Representing Text for Joint Embedding of Text and Knowledge Bases, in Empirical Methods in Natural Language Processing (EMNLP), ACL – Association for Computational Linguistics, 17 September 2015.
- Emre Kıcıman and Matthew Richardson, Towards Decision Support and Goal Achievement: Identifying Action-Outcome Relationships from Social Media, in Knowledge Discovery and Data Mining (KDD), ACM – Association for Computing Machinery, August 2015.
- Kristina Toutanova, Waleed Ammar, Pallavi Chourdhury, and Hoifung Poon, Model Selection for Type-Supervised Learning with application to POS Tagging, in The SIGNLL Conference on Computational Natural Language Learning, ACL – Association for Computational Linguistics, 30 July 2015.
- Kristina Toutanova and Danqi Chen, Observed Versus Latent Features for Knowledge Base and Text Inference, in 3rd Workshop on Continuous Vector Space Models and Their Compositionality, ACL – Association for Computational Linguistics, 30 July 2015.
- Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao, Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base, in Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, ACL – Association for Computational Linguistics, 28 July 2015.
- Igor Labutov, sumit basu, and lucy vanderwende, Deep Questions without Deep Understanding, to appear in: Proceedings of ACL 2015, July 2015.
- Lucy Vanderwende, Arul Menezes, and Chris Quirk, An AMR parser for English, French, German, Spanish and Japanese and a new AMR-annotated corpus, Proceedings of NAACL 2015, June 2015.
- Sauleh Eetemadi and Kristina Toutanova, Detecting Translation Direction: A Cross-Domain Study, in NAACL Student Research Workshop, ACL – Association for Computational Linguistics, 1 June 2015.
- Ankur P. Parikh, Hoifung Poon, and Kristina Toutanova, Grounded Semantic Parsing for Complex Knowledge Extraction, in NAACL HLT 2015, ACL – Association for Computational Linguistics, 1 June 2015.
- Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Meg Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan, A Neural Network Approach to Context-Sensitive Generation of Conversational Responses, Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT 2015), 1 June 2015.
- Wen-tau Yih, Xiaodong He, and Jianfeng Gao, Deep Learning and Continuous Representations for NLP (Tutorial for NAACL-HLT-2015), 31 May 2015.
- Huan Sun, Hao Ma, Wen-tau Yih, Chen-Tse Tsai, Jingjing Liu, and Ming-Wei Chang, Open Domain Question Answering via Semantic Enrichment, in Proceedings of the companion publication of the 24th international conference on World Wide Web, ACM – Association for Computing Machinery, May 2015.
- Ryen W. White, Matthew Richardson, and Wen-tau Yih, Questions vs. Queries in Informational Search Tasks, in Proceedings of the companion publication of the 24th international conference on World Wide Web, ACM – Association for Computing Machinery, May 2015.
- Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng, Embedding Entities and Relations for Learning and Inference in Knowledge Bases, in Proceedings of the International Conference on Learning Representations (ICLR) 2015, May 2015.
- Chen-Tse Tsai, Wen-tau Yih, and Christopher J.C. Burges, Web-based Question Answering: Revisiting AskMSR, no. MSR-TR-2015-20, April 2015.
- Lucy Vanderwende, NLPwin – an introduction, no. MSR-TR-2015-23, March 2015.
- Hoifung Poon, Kristina Toutanova, and Chris Quirk, Distant Supervision for Cancer Pathway Extraction from Text, in Pacific Symposium on Biocomputing (PSB), 4 January 2015.
- Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng, Learning Multi-Relational Semantics Using Neural-Embedding Models, in NIPS 2014 workshop on Learning Semantics, 12 December 2014.
- Daniel Guo, Gokhan Tur, Wen-tau Yih, and Geoffrey Zweig, Joint Semantic Utterance Classification and Slot Filling with Recursive Neural Networks, in 2014 IEEE Spoken Language Technology Workshop (SLT 2014), IEEE – Institute of Electrical and Electronics Engineers, 10 December 2014.