Dong Yu, Shizhen Wang, Zahi karam, and Li Deng
We present a novel language identification technique using our recently developed deep-structured conditional random fields (CRFs). The deep-structured CRF is a multi-layer CRF model in which each higher layer’s input observation sequence consists of the lower layer’s observation sequence and the resulting lower layer’s frame-level marginal probabilities. In this paper we extend the original deep-structured CRF by allowing for distinct state representations at different layers and demonstrate its benefits. We propose an unsupervised algorithm to pre-train the intermediate layers by casting it as a multi-objective programming problem that is aimed at minimizing the average frame-level conditional entropy while maximizing the state occupation entropy. Empirical evaluation on a seven-language/dialect voice mail routing task showed that our approach can achieve a routing accuracy (RA) of 86.4% and average equal error rate (EER) of 6.6%. These results are significantly better than the 82.5% RA and 7.5% average EER obtained using the Gaussian mixture model trained with the maximum mutual information criterion but slightly worse than the 87.7% RA and 6.4% EER achieved using the support vector machine with model pushing on the Gaussian super vector (GSV).
© 2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. http://www.ieee.org/