Xiong Xiao, Jinyu Li, and et. al
Classical mean and variance normalization (MVN) uses a di- agonal transform and a bias vector to normalize the mean and variance of noisy features to reference values. As MVN uses di- agonal transform, it ignores correlation between feature dimen- sions. Although full transform is able to make use of feature correlation, its large amount of parameters may not be estimated reliably from a short observation, e.g. 1 utterance. We propose a novel structured full transform that has the same amount of free parameters as diagonal transform while being able to capture correlation between feature dimensions. The proposed struc- tured transform can be estimated reliably from one utterance by maximizing the likelihood of the normalized features on a refer- ence Gaussian mixture model. Experimental results on Aurora- 4 task show that the structured transform produces consistently better speech recognition results than diagonal transform and also outperforms advanced frontend (AFE) feature extractor.
|Published in||Proc. Interspeech|