Microsoft Research Treelet Translation System: NIST MT Evaluation 06

MSR-TR-2007-144 |

Publication

MSRLM is the release of our internal language modeling tool chain used in Microsoft Research. It was used in our submission for NIST MT 2006. The main difference with other freely available tools is that it was designed to scale to large amounts of data. We successfully built a language model on high end hardware on 40 billion words of web data within 8 hours. It only supports a minimal set of features. Large gigaword language models may be consumed in a first pass machine translation decoding without further processing. This document describes the implementation and usage of the tools summarily. It is our stated goal and hope that this release will be useful to the scientific community. The toolmay not be used in a commercial product, or to build models used in a commercial product, or in for any commercial purpose. In addition, we require that you kindly cite this technical report when publishing results derived with this language model tool chain.