Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
A Language-Modeling Approach to Inverse Text Normalization and Data Cleanup

Yun-Cheng Ju and Julian Odell

Abstract

In this paper we address two related problems in multimodal local search application on mobile devices. First, correctly displaying the business names, and second, harvesting language model training data from inconsistently labeled corpus. We give quantitative investigation into the impact of common text normalization and language model training procedures. Our proposed new language model framework eliminated the need for inverse text normalization, or “pretty print” with supreme accuracy. We also demonstrate the same framework salvages, or cleans up, dirty language model training data automatically. Our new language model performs 25% more accurately and is 25% smaller in size.

Details

Publication typeInproceedings
PublisherInternational Speech Communication Association
> Publications > A Language-Modeling Approach to Inverse Text Normalization and Data Cleanup