NIPS 2008 WORKSHOP
Speech and Language: Learning-based Methods and Systems
Friday, December 12, 2008
Whistler, British Columbia, Canada
Authors: Dong Yu, Li Deng, and Alex Acero
We present the maximum entropy (MaxEnt) model with continuous features. We show that for the continuous features the weights should be continuous functions instead of single values. We propose a spline interpolation based solution to the optimization problem that contains continuous weights and illustrate that the optimization problem can be converted into a standard log-linear one without continuous weights at a higher-dimensional space.
Authors: Sangyun Hahn and Mari Ostendorf
Recently, semi-supervised learning has been an active research topic in the natural language processing community, to save effort in hand-labeling for data-driven learning and to exploit a large amount of readily available unlabeled text. In this paper, we apply EM-based semi-supervised learning algorithms such as traditional EM, co-EM, and cross validation EM to the task of agreement/disagreement classification of multi-party conversational speech, using discriminative models such as support vector machines and multi-layer perceptrons. We experimentally compare and discuss their advantages and weaknesses when used with different amounts of unlabeled data.
Authors: Hui Lin, Li Deng, Jasha Droppo, Dong Yu, and Alex Acero
One key issue in developing learning methods for multilingual acoustic modeling in large vocabulary automatic speech recognition (ASR) applications is to maximize the benefit of boosting the acoustic training data from multiple source languages while minimizing the negative effects of data impurity arising from language “mismatch”. In this paper, we introduce two learning methods, semiautomatic unit selection and global phonetic decision tree, to address this issue via effective utilization of acoustic data from multiple languages. The semi-automatic unit selection is aimed to combine the merits of both data-driven and knowledgedriven approaches to identifying the basic units in multilingual acoustic modeling. The global decision-tree method allows clustering of cross-center phones and cross-center states in the HMMs, offering the potential to discover a better sharing structure beneath the mixed acoustic dynamics and context mismatch caused by the use of multiple languages’ acoustic data. Our preliminary experiment results show that both of these learning methods improve the performance of multilingual speech recognition.
Authors: Matthew Miller and Alexander Stoytchev
Human beings have an apparently innate ability to segment continuous audio speech into words, and that ability is present in infants as young as 8 months old. This propensity towards audio segmentation seems to lay the groundwork for language learning in human beings. To artificially reproduce this ability would be both practically useful and theoretically enlightening. In this paper we propose an algorithm for the unsupervised segmentation of audio speech, based on the Voting Experts (VE) algorithm, which was originally designed to segment sequences of discrete tokens into categorical episodes. We demonstrate that our procedure is capable of inducing breaks with an accuracy substantially greater than chance, and suggest possible avenues of exploration to further increase the segmentation quality. We also show that this algorithm can reproduce results obtained from segmentation experiments performed with 8-month-old infants.
Authors: Yong Zhao and Xiaodong He
This paper proposes using n-gram posterior probabilities, which are estimated over translation hypotheses from multiple machine translation (MT) systems, to improve the performance of the system combination. Two ways using n-gram posteriors in confusion network decoding are presented. The first way is based on n-gram posterior language model per source sentence, and the second, called n-gram segment voting, is to boost word posterior probabilities with n-gram occurrence frequencies. The two n-gram posterior methods are incorporated in the confusion network as individual features of a log-linear combination model. Experiments on the Chinese-to-English MT task show that both methods yield significant improvements on the translation performance, and an combination of these two features produces the best translation performance.