B. Frey, Li Deng, T. Kristjansson, and Alex Acero
September 2001
One approach to robust speech recognition is to use a simple
speech model to remove the distortion, before applying the
speech recognizer. Previous attempts at this approach have relied
on unimodal or point estimates of the noise for each utterance.
In challenging acoustic environments, e.g., an airport,
the spectrum of the noise changes rapidly during an utterance,
making a point estimate a poor representation. We show how
an iterative form of Laplace’s method can be used to estimate
the clean speech, using a time-varying probability model of the
log-spectra of the clean speech, noise and channel distortion.
We use this method, called ALGONQUIN, to denoise speech
features and then feed these features into a large vocabulary
speech recognizer whose WER on the clean Wall Street Journal
data is 4.9%. When 10 dB of noise consisting of an airplane engine
shutting down is added to the data, the recognizer obtains
a WER of 28.8%. ALGONQUIN reduces the WER to 12.6%,
well below the WER of 25.0% obtained by our spectral subtraction
algorithm, and close to the WER of 9.7% obtained by
the slow procedure of retraining the recognizer on training data
corrupted by the exact same noise. In fact, if ALGONQUIN is
used to denoise the noisy training data before the recognizer is
retrained, the WER is improved to 8.5%. For 10 dB of additive
uniform white noise, our spectral subtraction algorithm reduces
the WER from 55.1% to 33.8%. ALGONQUIN reduces the
WER to 14.2%. The recognizer trained on noisy data obtains
a WER of 14%, whereas the recognizer trained on noisy data
denoised by ALGONQUIN obtains a WER of 9.9%.
![]() PDF file |
In Proc. of the Eurospeech Conference
| Type | Inproceedings |