Feature normalization using structured full transforms for robust speech recognition

Proc. Interspeech |

Classical mean and variance normalization (MVN) uses a di-
agonal transform and a bias vector to normalize the mean and
variance of noisy features to reference values. As MVN uses di-
agonal transform, it ignores correlation between feature dimen-
sions. Although full transform is able to make use of feature
correlation, its large amount of parameters may not be estimated
reliably from a short observation, e.g. 1 utterance. We propose a
novel structured full transform that has the same amount of free
parameters as diagonal transform while being able to capture
correlation between feature dimensions. The proposed struc-
tured transform can be estimated reliably from one utterance by
maximizing the likelihood of the normalized features on a refer-
ence Gaussian mixture model. Experimental results on Aurora-
4 task show that the structured transform produces consistently
better speech recognition results than diagonal transform and
also outperforms advanced frontend (AFE) feature extractor.