Jasha Droppo, Alex Acero, and Li Deng
In this paper we present a new statistical model, which describes the corruption to speech recognition Mel-frequency spectral features caused by additive noise. This model explicitly represents the effect of unknown phase together with the unobserved clean speech and noise as three hidden variables. We use this model to produce noise robust features for automatic speech recognition. The model is constructed in the log Mel-frequency feature domain. In addition to being linearly related to MFCC recognition parameters, we gain the advantage of low dimensionality and independence of the corruption across feature dimensions. We illustrate the surprising result that, even when the true noise Mel-frequency spectral feature is known, the traditional spectral subtraction formula is flawed. We show the new model can be used to derive a spectral subtraction formula which produces superior error rate results, and is less sensitive to tuning parameters. Finally, we present results demonstrating that the new model is more general than spectral subtraction, and can take advantage of a prior noise estimate to produce robust features, rather than relying on point estimates of noise.
|Published in||Proc. International Conference on Spoken Language Processing|