Sent2Vec maps a pair of short text strings (e.g., sentences or query-answer pairs) to a pair of feature vectors in a continuous, low-dimensional space where the semantic similarity between the text strings is computed as the cosine similarity between their vectors in that space. Sent2Vec performs the mapping using the Deep Structured Semantic Model (DSSM) proposed in (Huang et al. 2013), or the DSSM with convolutional-pooling structure (CDSSM) proposed in (Shen et al. 2014; Gao et al. 2014).

      MSR SPLAT. Statistical Parsing and Linguistic Analysis Toolkit is a linguistic analysis toolkit. Its main goal is to allow easy access to the linguistic analysis tools produced by the Natural Language Processing group at Microsoft Research. The tools include both traditional linguistic analysis tools such as part-of-speech taggers and parsers, and more recent developments, such as sentiment analysis (identifying whether a particular of text has positive or negative sentiment towards its focus). Refer to our paper for a detailed description of SPLAT.

       Microsoft Web N-gram Services are jointly developed by Microsoft Research and Microsoft Bing. We invite the whole community to use the Web N-gram services, made available via a cloud-based platform, to drive discovery and innovation in web search, natural language processing, speech, and related areas by conducting research on real-world web-scale data, taking advantage of regular data updates for projects that benefit from dynamic data. Here is my talk given at John Hopkins University. The talk is based on our WWW-2010 paper.

       Bayesian Estimators for Unsupervised HMM POS tagger. This toolkit provides six different Bayesian estimators for unsupervised Hidden Markov Model Part-of-Speech taggers, reported in the 2008 paper by Jianfeng Gao and Mark Johnson, ^A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers ̄ presented in the 2008 Conference on Empirical Methods on Natural Language Processing.

       The Microsoft Research ESL Assistant is a web service that provides correction suggestions for typical ESL (English as a Second Language) errors. Such errors include, for example, the choice of determiners (the/a) and the choice of prepositions. The web service also provides word choice suggestions from a thesaurus. In order to help the user make decisions on whether to accept a suggestion, the service displays "before and after" web search results so that the user can see real-life examples of the usage of both their original input and the suggested correction. An Outlook plugin that connects to the web service and copies text from an email into the web service UI is also available. For a detailed description of the system, see our paper.

       The MSRLM (download here) is a Scalable Language Modeling Toolkit, Microsoft Research Language Modeling. The toolkit implements an efficient method to build large language models, from billions of words and upwards. We use these language models for first-pass decoding in statistical machine translation.

       "Orthant-Wise Limited-memory Quasi-Newton" algorithm (OWL-QN) is a new method for optimizing an L1-regularized loss that is very efficient, even on problems with millions of parameters. Source code for OWL-QN, including a standalone trainer for L1-regularized least-squares or logistic regression, is available for download. Refer to (Galen and Gao, 2007) for the description of the algorithm, and (Gao et al., 2007) for its application in several NLP tasks, and a comparison with other state-of-the-art parameter estimators.

       Microsoft Research IME Corpus provides a test data set for the task of Japanese character conversion for text input. For more about the corpus, see our technical report.

       S-MSRSeg is simplified version of the Chinese word segmenter and named entity recognizer described in (Gao et al., 2005).