Soft margin estimation on improving environment structures for ensemble speaker and speaking environment modeling

Yu Tsao, Jinyu Li, Chin-Hui Lee, and Satoshi Nakamura

Abstract

Recently, we proposed an ensemble speaker and speaking

environment modeling (ESSEM) approach to enhance the

robustness of automatic speech recognition (ASR) under adverse

conditions. The ESSEM framework comprises two phases, offline

and online phases. In the offline phase, we prepare an

environment structure that is formed by multiple sets of hidden

Markov models (HMMs). Each HMM set represents a particular

speaker and speaking environment. In the online phase, ESSEM

estimates a mapping function to transform the prepared

environment structure to a set of HMMs for the unknown testing

condition. In this study, we incorporate the soft margin estimation

(SME) to increase the discriminative power of the environment

structure in the offline stage and therefore enhance the overall

ESSEM performance. We evaluated the performance on the

Aurora-2 connected digit database. With the SME refined

environment structure, ESSEM provides better performance than

the original framework. By using our best online mapping

function, ESSEM achieves a word error rate (WER) of 4.62%,

corresponding to 14.60% relative WER reduction (from 5.41% to

4.62%) over the best baseline performance of 5.41% WER.

Details

Publication typeInproceedings
Published inACM Proc. the 3rd International Universal Communication Symposium
> Publications > Soft margin estimation on improving environment structures for ensemble speaker and speaking environment modeling