Speech and Language: Learning-based Methods and Systems

Friday, December 12, 2008

Whistler, British Columbia, Canada


New Multi-level Models for High-dimensional Sequential Data

Geoffrey Hinton, University of Toronto


I will describe recent developments in learning algorithms for multilevel nonlinear generative models of sequential data. The models are learned greedily, one layer of features at a time and each additional layer of nonlinear features improves the overall generative model of the data. In earlier work (Taylor et. al. 2006) the basic module used for learning each layer of representation was a restricted Boltzmann machine in which both the hidden and visible units have biases that are dynamically determined by previous frames of data. This simple learning module has now been generalized to allow more complicated, multiplicative interactions so that hidden variables at one level can control the interactions between variables at the level below. These models have not yet been applied to speech but they work well on other data such as broadcast video and sequences of joint-angles derived from motion capture markers.  (Joint work with Roland Memisevic, Graham Taylor and Ilya Sutskever).



Log-linear Approach to Discriminative Training

Ralf Schlüter, RWTH Aachen University


The objective of this talk is to establish a log-linear modeling framework in the context of discriminative training criteria, with examples from automatic speech recognition and concept tagging. The talk covers three major aspects. First, the acoustic models of conventional state-of-the-art speech recognition systems conventionally use generative Gaussian HMMs. In the past few years, discriminative models like for example Conditional Random Fields (CRFs) have been proposed to refine acoustic models. This talk addresses to what extent such less restricted models add flexibility to the model compared with the generative counterpart. Certain equivalence relations between Gaussian and log-linear HMMs are established, including context conditional models. Second, it will be shown how conventional discriminative training criteria in speech recognition such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. As a result, large-margin training in speech recognition can be performed using the same efficient algorithms for accumulation and optimization and using the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Third, CRFs are often estimated using an entropy based criterion in combination with Generalized Iterative Scaling (GIS). GIS offers, upon others, the immediate advantages that it is locally convergent, completely parameter free, and guarantees an improvement of the criterion in each step. Here, GIS is extended to allow for training log-linear models with hidden variables and optimization of discriminative training criteria different from Maximum Entropy/Maximum Mutual Information, including Minimum Phone Error (MPE). Finally, experimental results are provided for different tasks, including the European Parliament Plenary Sessions task as well as Mandarin Broadcasts.



On the Role of Local Learning for Language Modeling

Mari Ostendorf, University of Washington


Local learning methods, such as nearest-neighbor and variants, are known to be very powerful for many problems, particularly for problems where good models are not available. They can also be very useful for problems with a high degree of variability over the input space. In language modeling for speech recognition, local learning has not been particularly useful, in part because of the tremendous power of the n-gram when given large amounts of training data, and in part due to the difficulty of defining distance or similarity measures for word sequences.  However, language is quite variable, depending on both topic and genre, such that a model trained in one domain may be of little use in another. With the large amount of data available on the web, and the large number of possible topic/genre combinations, it is of interest to consider local learning for language model adaptation. In this talk, we look at leveraging the similarity function in language model adaptation to benefit from a small neighborhood without losing the power of a large training corpus.



Ensemble Machine Learning Methods for Acoustic Modeling of Speech

Yunxin Zhao, University of Missouri


Improving recognition accuracy of human speech by computers has been a long standing challenge. Over the past few decades, tremendous research efforts have been made on the optimization of acoustic models. On the other hand, ensemble classifier design is becoming an important direction in machine learning. Different from the commonly adopted approach of optimizing a single classifier, ensemble methods achieve pattern discrimination through synergically combining many classifiers that are complementary in nature. Ensemble methods have shown advantages in classification accuracy and robustness in a variety of application contexts. Aligned with this direction, combining output word hypotheses from multiple speech recognition systems is being increasingly used in ASR for boosting the accuracy performance.  Nonetheless, the complexity of speech sound distributions warrants the exploration of using ensemble methods to build robust and accurate acoustic models, where the component models of an ensemble can be combined in computing the acoustic scores during decoding search, for example, at the speech frame level, and thereby a single recognition system would suffice.  Recently, some innovative progresses have been made in this direction, producing promising results and revealing attractive properties of ensemble acoustic models.  This talk will address several basic issues in ensemble acoustic modeling, including constructing acoustic model ensembles, combining acoustic models in an ensemble, measuring the ensemble quality, etc. Experimental findings will be provided for a conversational speech recognition task, and a discussion will be made regarding research opportunities along this path.



Relations Between Graph Triangulation, Stack Decoding, and Synchronous Decoding

Jeff Bilmes, University of Washington


Speech recognition systems have historically utilized essentially one of two decoding strategies. Stack decoding (also called asynchronous decoding) allows internal decoding hypotheses to exist that have an end-time that spans over a potentially wide range of time frames. Such strategies are amenable to techniques such as A*-search assuming one has available a reasonable continuation heuristic. An alternate decoding strategy is the time-synchronous approach, whereby every active hypothesis has a similar or identical ending time. In this talk, we relate these two decoding strategies to inference procedures in dynamic graphical models (which includes Dynamic Bayesian networks and hidden conditional random fields). In particular, we see that under a hybrid search/belief-propagation inference scheme, the underlying triangulation of the graph determines which of the above two decoding strategies are active. The triangulation, moreover, also suggests decoding strategies that lie somewhere between strictly synchronous and asynchronous approaches.



Markov Logic Networks: A Unified Approach to Language Processing

Pedro Domingos, University of Washington


Language processing systems typically have a pipeline architecture, where errors accumulate as information progresses through the pipeline. The ideal solution is to perform fully joint learning and inference across all stages of the pipeline (part-of-speech tagging, parsing, coreference resolution, semantic role labeling, etc.) To make this possible without collapsing under the weight of complexity, we need a modeling language that provides a common representation for all the stages and makes it easy to combine them. Markov logic networks accomplish this by attaching weights to formulas in first-order logic and viewing them as templates for features of Markov random fields. In this talk, I will describe some of the main inference and learning algorithms for Markov logic, show how Markov logic can be used to implement an end-to-end NLP system, and present the state-of-the-art results we have obtained with the components we have implemented so far.



Some Machine Learning Issues in Discriminative Bilingual Word Alignment

Robert Moore, Microsoft Research


Bilingual word alignment is the task of identifying the word tokens that are translations of each other in a corpus of sentence pairs that are translations of each other.  After being dominated by generative models since the early 1990s, beginning in 2005 this task has been addressed by a number of discriminative approaches, resulting in substantially reduced alignment error rates.  In most cases, these discriminative approaches have used a few hundred parallel sentence pairs with word alignments annotated, plus hundreds of thousands of parallel sentence pairs with no word-level annotation, making this task a prime example of semi-supervised learning.  In this talk, we will look in detail at some of the machine learning issues in one of the most successful efforts at discriminative word alignment, including benefits of stacking of learners and refinements of the averaged perceptron approach to learning classifiers with structured outputs.



Machine Learning for Speaker Recognition

Andreas Stolcke, SRI International


This talk will review some of the main ML techniques employed in state-of-the-art speaker recognition systems, in terms of both modeling and feature design.  For modeling, the two main paradigms currently in use are Gaussian mixture models with joint factor analysis, and support vector machines.  The latter in particular have enabled a wealth of approaches that model speakers via high dimensional feature vectors drawn from a wide range of observation spaces, including cepstral, phonetic, prosodic, and lexical features. A pervasive problem in feature design is how to collapse a variable-length stream of observations into a fixed-length feature vector. SVM kernels designed for this situation are based on features generated by polynomial expansion, N-gram frequencies, and GMM mixture weights. Miscellaneous other issues include parameter smoothing (prior modeling) and model combination. It is hoped that the talk will give a glimpse into a fascinating application domain for machine learning methods, and instigate ML researchers to contribute to advances in speaker recognition.


Back to Workshop Home

NIPS 2008 Home

NIPS 2008 Workshop Program