·
MSR SPLAT.
Statistical Parsing and Linguistic Analysis Toolkit is a linguistic analysis
toolkit. Its main goal is to allow easy access to the linguistic analysis tools
produced by the Natural Language Processing group at Microsoft Research. The
tools include both traditional linguistic analysis tools such as part-of-speech
taggers and parsers, and more recent developments, such as sentiment analysis
(identifying whether a particular of text has positive or negative sentiment
towards its focus). Refer to our paper for a
detailed description of SPLAT.
·
Microsoft
Web N-gram Services are jointly developed by Microsoft Research and
Microsoft Bing. We invite the whole community to use
the Web N-gram services, made available via a cloud-based platform, to drive
discovery and innovation in web search, natural language processing,
speech, and related areas by conducting research on real-world web-scale data,
taking advantage of regular data updates for projects that benefit from dynamic
data. Here is my talk given at John Hopkins
University. The talk is based on our WWW-2010
paper.
·
Bayesian
Estimators for Unsupervised HMM POS tagger. This toolkit provides six different Bayesian estimators for
unsupervised Hidden Markov Model Part-of-Speech taggers, reported in the 2008
paper by Jianfeng Gao and Mark Johnson, “A comparison of Bayesian estimators for
unsupervised Hidden Markov Model POS taggers” presented in the 2008
Conference on Empirical Methods on Natural Language Processing.
·
The
Microsoft Research ESL Assistant is a
web service that provides correction suggestions for typical ESL (English as a
Second Language) errors. Such errors include, for example, the choice of
determiners (the/a) and the choice of prepositions. The web service also
provides word choice suggestions from a thesaurus. In order to help the user
make decisions on whether to accept a suggestion, the service displays
"before and after" web search results so that the user can see
real-life examples of the usage of both their original input and the suggested
correction. An Outlook plugin that connects to the web service and copies text
from an email into the web service UI is also available. For a detailed
description of the system, see our paper.
·
The
MSRLM
(download here) is a
Scalable Language Modeling Toolkit, Microsoft Research Language Modeling. The
toolkit implements an efficient method to build large language models, from
billions of words and upwards. We use these language models for first-pass
decoding in statistical machine translation.
·
"Orthant-Wise Limited-memory Quasi-Newton" algorithm (OWL-QN) is a new
method for optimizing an L1-regularized loss that is very efficient, even on problems
with millions of parameters. Source code for OWL-QN, including a standalone
trainer for L1-regularized least-squares or logistic regression, is available for download.
Refer to (Galen
and Gao, 2007) for the description
of the algorithm, and (Gao et al., 2007) for its application in several NLP
tasks, and a comparison with other state-of-the-art parameter estimators.
·
Microsoft Research IME Corpus provides a test
data set for the task of Japanese character conversion for text input. For more
about the corpus, see our technical report.
·
S-MSRSeg is simplified version of the Chinese word segmenter
and named entity recognizer described in (Gao et al., 2005).