Towards High-Accuracy Low-Cost Noisy Robust Speech Recognition Exploiting Structured Model

ICML Workshop 2011 |

It is well known that the distorted speech x can be considered generated from the clean speech h with the additive noise n and the convolutive channel  h as  y = x * h + n. In this paper, we present our recent study on using this structured model of physical distortion for robust automatic speech recognition. Three methods are introduced for joint compensation of additive and convolutive distortions (JAC), with different online computation costs. They are JAC model adaptation, GMM-based JAC model adaptation, and JAC feature enhancement. All these algorithms consist of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment distortion model in the cepstral domain, and the vector-Taylor-series (VTS) linearization technique collectively. Second, the estimated noise and channel parameters are used to adapt the hidden Markov model (HMM) parameters or clean the distorted speech feature.

In the experimental evaluation using the standard Aurora 2 task, the proposed JAC algorithms all achieve around 89% accuracy using the clean-trained complex HMM backend, compare favorably over previously developed techniques. In the meanwhile, the JAC feature enhancement method has much smaller computation cost than the other two methods, and can be used as a high-accuracy low-cost noise robust front end. Detailed analysis on the experimental results shows that online updating all the noise and channel distortion parameters is critical to the success of our algorithms.