A Language-Modeling Approach to Inverse Text Normalization and Data Cleanup

In this paper we address two related problems in multimodal local search application on mobile devices. First, correctly displaying the business names, and second, harvesting language model training data from inconsistently labeled corpus. We give quantitative investigation into the impact of common text normalization and language model training procedures. Our proposed new language model framework eliminated the need for inverse text normalization, or “pretty print” with supreme accuracy. We also demonstrate the same framework salvages, or cleans up, dirty language model training data automatically. Our new language model performs 25% more accurately and is 25% smaller in size.

IS080913.pdf
PDF file

Publisher  International Speech Communication Association
© 2007 ISCA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the ISCA and/or the author.

Details

TypeInproceedings
> Publications > A Language-Modeling Approach to Inverse Text Normalization and Data Cleanup