ALGONQUIN: Iterating Laplace's Method to Remove Multiple Types of Acoustic Distortion for Robust Speech Recognition

One approach to robust speech recognition is to use a simple

speech model to remove the distortion, before applying the

speech recognizer. Previous attempts at this approach have relied

on unimodal or point estimates of the noise for each utterance.

In challenging acoustic environments, e.g., an airport,

the spectrum of the noise changes rapidly during an utterance,

making a point estimate a poor representation. We show how

an iterative form of Laplace’s method can be used to estimate

the clean speech, using a time-varying probability model of the

log-spectra of the clean speech, noise and channel distortion.

We use this method, called ALGONQUIN, to denoise speech

features and then feed these features into a large vocabulary

speech recognizer whose WER on the clean Wall Street Journal

data is 4.9%. When 10 dB of noise consisting of an airplane engine

shutting down is added to the data, the recognizer obtains

a WER of 28.8%. ALGONQUIN reduces the WER to 12.6%,

well below the WER of 25.0% obtained by our spectral subtraction

algorithm, and close to the WER of 9.7% obtained by

the slow procedure of retraining the recognizer on training data

corrupted by the exact same noise. In fact, if ALGONQUIN is

used to denoise the noisy training data before the recognizer is

retrained, the WER is improved to 8.5%. For 10 dB of additive

uniform white noise, our spectral subtraction algorithm reduces

the WER from 55.1% to 33.8%. ALGONQUIN reduces the

WER to 14.2%. The recognizer trained on noisy data obtains

a WER of 14%, whereas the recognizer trained on noisy data

denoised by ALGONQUIN obtains a WER of 9.9%.

2001-frey-eurospeech.pdf
PDF file

In  Proc. of the Eurospeech Conference

Details

TypeInproceedings
> Publications > ALGONQUIN: Iterating Laplace's Method to Remove Multiple Types of Acoustic Distortion for Robust Speech Recognition