Chao Huang, Eric Chang, Jianlai Zhou, Kai-Fu Lee, and Shuo Di
Large vocabulary continuous Mandarin speech recognition has been an important problem for speech recognition researchers for several reasons. First of all, it is a tonal language that requires special treatment for the modeling of tones. There are five tones in Mandarin which are necessary to disambiguate between confusable words. Secondly, the difficulty of entering Chinese by keyboard presents a great opportunity for speech recognition to improve computer usability. Previous approaches to modeling tones have included using a separate tone classifier and incorporating pitch directly into the feature vector. In this paper, we describe a large vocabulary Mandarin speech recognition system based on Microsoft’s Whisper system. Several alternatives in modeling tones and their error rates on continuous speech are compared. The experimental result shows a character error rate of 7.32% on a test set of 50 speakers and 1000 sentences when no special tone processing is performed in the acoustic model. When the final syllable model set is expanded to include tones, the error rate drops to 6.43% (error rate reduction of 12.2%). When pitch information and the larger final syllable set are used in combination, the error rate is 6.03% (cumulative error rate reduction of 17.6%). This result suggests that other sources of information such as energy and duration can also contribute toward disambiguating between different tones.