Soft margin estimation on improving environment structures for ensemble speaker and speaking environment modeling

ACM Proc. the 3rd International Universal Communication Symposium |

Recently, we proposed an ensemble speaker and speaking
environment modeling (ESSEM) approach to enhance the
robustness of automatic speech recognition (ASR) under adverse
conditions. The ESSEM framework comprises two phases, offline
and online phases. In the offline phase, we prepare an
environment structure that is formed by multiple sets of hidden
Markov models (HMMs). Each HMM set represents a particular
speaker and speaking environment. In the online phase, ESSEM
estimates a mapping function to transform the prepared
environment structure to a set of HMMs for the unknown testing
condition. In this study, we incorporate the soft margin estimation
(SME) to increase the discriminative power of the environment
structure in the offline stage and therefore enhance the overall
ESSEM performance. We evaluated the performance on the
Aurora-2 connected digit database. With the SME refined
environment structure, ESSEM provides better performance than
the original framework. By using our best online mapping
function, ESSEM achieves a word error rate (WER) of 4.62%,
corresponding to 14.60% relative WER reduction (from 5.41% to
4.62%) over the best baseline performance of 5.41% WER.