MSRLM: a Scalable Language Modeling Toolkit

MSR-TR-2007-144 |

MSRLM is the release of our internal language modeling tool chain used in Microsoft Research. It was used in our submission for NIST MT 2006. The main difference with other freely available tools is that it was designed to scale to large amounts of data. We successfully built a language model on high end hardware on 40 billion words of web data within 8 hours. It only supports a minimal set of features. Large gigaword language models may be consumed in a first pass machine translation decoding without further processing. This document describes the implementation and usage of the tools summarily.