MSRLM: a scalable language modeling toolkit

Patrick Nguyen, Jianfeng Gao, and Milind Mahajan

Abstract

MSRLM is the release of our internal language modeling tool chain used in Microsoft Research. It was used in our submission for NIST MT 2006. The main difference with other freely available tools is that it was designed to scale to large amounts of data. We successfully built a language model on high end hardware on 40 billion words of web data within 8 hours. It only supports a minimal set of features. Large gigaword language models may be consumed in a first pass machine translation decoding without further processing. This document describes the implementation and usage of the tools summarily. This describes the LM tool which is available as: http://research.microsoft.com/research/downloads/details/78e26f9c-fc9a-44bb-80a7-69324c62df8c/details.aspx.

Details

Publication typeTechReport
NumberMSR-TR-2007-144
Pages19
InstitutionMicrosoft Research
> Publications > MSRLM: a scalable language modeling toolkit