Feature Compensation Using Linear Combination of Speaker and Environment Dependent Correction Vectors

ICASSP |

In this paper, we study a novel way to compensate speech features to counter the effects of speaker variations and environment distortions in speech recognition. For each homogeneous cluster of speech data, e.g. a specific speaker and environment combination, a set of correction vectors are learnt. A correction vector measures the deviation of features in a small region of feature space due to the speaker and environment effects. From a heterogenous training set, dozens of sets of correction vectors are learnt, each from a homogenous subset of the data. During testing, those correction vector sets are linearly combined to compensate test feature vectors. The combination weights are estimated by maximizing the likelihood (ML) of the compensated features with respect to a reference model, which is a simplified version of the acoustic model used for speech recognition. In addition, variance compensation is applied to condition the variances of the compensated features during weight estimation. Experimental results on Aurora-4 multi-condition training task show that the proposed correction vector combination method reduces the word error rate (WER) to 14.97% from mean and variance normalization baseline (16.32%) for noisy test sets 2-7. In addition, the proposed ML weight estimation consistently outperforms the posterior weights used in previous studies, such as multi-environment SPLICE.