Deep Learning for Pronunciation Training and Evaluation

Computer Aided Language Learning (CALL)

A New DNN-based High Quality Pronunciation Evaluation for Computer-Aided Language Learning (CALL)

Interspeech 2013

Wenping Hu, Yao Qian, Frank K. Soong


In this paper, we propose to use Deep Neural Net (DNN), which has been recently shown to reduce speech recognition errors significantly, in Computer-Aided Language Learning (CALL) to evaluate English learners’ pronunciations. Multi-layer, stacked Restricted Boltzman Machines (RBMs), are first trained as nonlinear basis functions to represent speech signals succinctly, and the output layer is discriminatively trained to optimize the posterior probabilities of correct, sub-phonemic “senone” states. Three Goodness of Pronunciation (GOP) scores, including: the likelihood-based posterior probability, averaged framelevel posteriors of the DNN output layer “senone” nodes, and log likelihood ratio of correct and competing models, are tested with recordings of both native and non-native speakers, along with manual grading of pronunciation quality. The experimental results show that the GOP estimated by averaged frame-level posteriors of “senones” correlate with human scores the best. Comparing with GOPs estimated with non-DNN, i.e. GMMHMM, based models, the new approach can improve the correlations relatively by 22.0% or 15.6%, at word or sentence levels, respectively. In addition, the frame-level posteriors, which doesn’t need a decoding lattice and its corresponding forwardbackward computations, is suitable for supporting fast, on-line, multi-channel applications.


App in Bing Dictionary V3.1.1 (我爱说英语)




Wenping Hu, Yao Qian, Frank K. Soong

In this paper we investigate a Deep Neural Network (DNN) based approach to acoustic modeling of tonal language and assess its speech recognition performance with different features and modeling techniques. Mandarin Chinese, the most widely spoken tonal language, is chosen for testing the tone related ASR performance. Furthermore, the DNN-trained, tone-sensitive model is evaluated in automatic detection of mispronunciation among L2 Mandarin learners. The best DNN-HMM acoustic model of tonal syllable (initial and tonal final), trained with embedded F0 features, has shown improved ASR performance, when compared with the baseline DNN system of 39 MFCC features. The proposed system achieves better ASR performance than the baseline system, i.e., by 32% and 35% in relative tone error rate reduction and 20% and 23% in relative tonal syllable error rate reduction, for female and male speakers, respectively. In a speech database of L2 Mandarin learners (native speakers of European languages), 2% equal error rate reduction, from 27.5% to 25.5%, has been obtained with our DNN-HMM system in detecting mispronunciations, compared with the baseline system.


A New Neural Network Based Logistic Regression Classifier For Improving Mispronunciation Detection of L2 Language Learners


Wenping Hu, Yao Qian, Frank K. Soong

In this paper, we propose a Neural Network (NN) based, Logistic Regression (LR) classifier for improving phone mispronunciation detection rate in a Computer-Aided Language Learning (CALL) system. A general neural network with multiple hidden layers for extracting useful speech features is first trained with pooled, training data, and then phone-dependent, 2-class logistic regression classifiers are trained as individual, phoneme specific nodes at the output layer. This new NN-based classifier with shared hidden layers streamlines the time-consuming work needed in training multiple individual classifiers separately, i.e., one for a specific phoneme, and learns common feature representation via the shared hidden layers. Its improved performance, when compared with independently trained, phoneme specific classifiers, is verified on a testing database of isolated English words recorded by non-native English learners. Compared with the conventional Goodness of Pronunciation (GOP)based approach, the NN-based LR classifier improves the precision and recall by 37.1% and 11.7% (absolute), respectively. On the same test data, it also outperforms a Support Vector Machine (SVM)-based classifier, which is widely used for mispronunciation detection, and at a slightly better precision rate, the recall is improved by 10.6% (absolute) and the relative improvement is 21.6%.