Ensemble speaker and speaking environment modeling approach with advanced online estimation process

Recently, we proposed an ensemble speaker and speaking

environment modeling (ESSEM) framework to characterize

speaker variability and speaking environments. In contrast to

multi-style training, ESSEM uses single-style training to prepare

multiple sets of environment-specific acoustic models. The

ensemble of these acoustic models forms a prior structure of the

environment for flexible prediction of unknown environment

during testing. In this study, we present methods to further improve

the precision for model characterization. We first study a weighted

N-best information technique to well utilize the N-best

transcription hypothesis in an unsupervised adaptation manner.

Next, we introduce cohort selection and environment space

adaptation techniques to online improve the resolution and

coverage of the prior structure. With an integration of the proposed

methods, we further improve the ESSEM performance over our

previous study. On the Aurora-2 task, ESSEM achieves an average

word error rate (WER) of 4.64%, corresponding to a 15.64%

relative WER reduction over our best baseline result (5.50% to

4.64% WER) obtained with multi-condition training.

icassp09.pdf
PDF file

In  Proc. ICASSP

Details

TypeInproceedings
> Publications > Ensemble speaker and speaking environment modeling approach with advanced online estimation process