Share on Facebook Tweet on Twitter Share on LinkedIn Share by email

Jun Du, Ren-Hua Wang, and Li-Rong Dai


It is well-known that the accuracy of an automatic speech recognition (ASR) system significantly decreases in noisy environments, if no measures are taken. The main objective of this work is to increase the robustness of an ASR system to the detrimental effect of various noise distortions. Noise-robust methods can be divided into two broad classes: feature-domain approach and model-domain approach. In this thesis, we carry out deep research for both classes and the main innovations are as follows:

Firstly, we propose a novel implicit model based feature normalization method, i.e. Cepstral Shape Normalization (CSN). It is found that the shape of speech feature distributions is changed in noisy environments compared with that in the clean condition. So CSN which normalizes the shape of feature distributions is performed by exploiting an exponential factor. This method has been proven more effective than traditional approaches, such as HEQ and HOCMN, especially under low SNRs.

Then, we turn to a new explicit distortion model based feature compensation method, i.e. Piecewise Linear Approximation (PLA). By using piecewise linear approximation of explicit model, we achieve more accurate approximation than two classical approaches, namely VTS and MAX approximations. Formulations for MLE of noise model parameters and MMSE estimation of clean speech are derived. A hybrid approach using different approximations for different types of noisy speech segments is also proposed, which can bring further improvements of performance. Beside speech recognition experiments, we also try speech enhancement using PLA and achieve good subjective and objective evaluations.

But PLA is not perfect. The disadvantage of PLA is that all the formulations are performed in LMFB domain without considering the correlations among channels of filter bank. So from another viewpoint of accurate approximation, we propose a novel High-Order Vector Taylor Series (HOVTS) approximation of explicit model. It has the following advantages: 1) both noise and channel distortions are involved in explicit model, 2) the nonlinear distortion function can be approximated by HOVTS with any order, 3) correlations among different channels of filter bank can be considered.

Finally, noise robustness of discriminatively trained HMMs are investigated. As preliminary knowledge, Minimum Divergence (MD), which is a new DT criterion proposed by us, is introduced first. Our experiments show that MD outperforms popular MPE in clean condition for both small and large tasks. Then issues related to noiserobust discriminative training, including comparison between MD and MWE/MPE, how to design ML baseline, and how to treat with silence/background model are also discussed.

For all the above proposed techniques, the experiments are performed on small tasks such as Aurora2 and Aurora3, which are continuous digital string databases and designed to verify noise-robust methods. For the sake of completeness, we compare different methods on Aurora4, which is a LVCSR database. The preliminary experiments demonstrate that noise robustness on LVCSR is still a difficult open problem.


Publication typePhdThesis
> Publications > 自动语音识别中的噪声鲁棒性方法