**NIPS 2008 WORKSHOP**

**Speech and
Language: Learning-based Methods and Systems**

**Friday,
December 12, 2008**

**Whistler,
British Columbia, Canada**

**INVITED TALKS**

**New Multi-level Models for
High-dimensional Sequential Data**

Geoffrey Hinton, University of Toronto

Abstract:

I will describe recent
developments in learning algorithms for multilevel nonlinear generative models
of sequential data. The models are learned greedily, one layer of features at a
time and each additional layer of nonlinear features improves
the overall generative model of the data. In earlier work (Taylor et. al. 2006)
the basic module used for learning each layer of representation was a
restricted Boltzmann machine in which both the hidden and visible units have
biases that are dynamically determined by previous frames of data. This simple
learning module has now been generalized to allow more complicated,
multiplicative interactions so that hidden variables at one level can control
the interactions between variables at the level below. These models have not
yet been applied to speech but they work well on other data such as broadcast
video and sequences of joint-angles derived from motion capture markers. (Joint work with Roland Memisevic, Graham Taylor and Ilya
Sutskever).

**Log-linear
Approach to Discriminative Training**

Ralf Schlüter, RWTH Aachen University

Abstract:

The objective of this talk
is to establish a log-linear modeling framework in the context of
discriminative training criteria, with examples from automatic speech
recognition and concept tagging. The talk covers three major aspects. First,
the acoustic models of conventional state-of-the-art speech recognition systems
conventionally use generative Gaussian HMMs. In the past few years,
discriminative models like for example Conditional Random Fields (CRFs) have
been proposed to refine acoustic models. This talk addresses to what extent
such less restricted models add flexibility to the model compared with the
generative counterpart. Certain equivalence relations between Gaussian and
log-linear HMMs are established, including context conditional models. Second,
it will be shown how conventional discriminative training criteria in speech
recognition such as the Minimum Phone Error criterion or the Maximum Mutual
Information criterion can be extended to incorporate a margin term. As a
result, large-margin training in speech recognition can be performed using the
same efficient algorithms for accumulation and optimization and using the same
software as for conventional discriminative training. We show that the proposed
criteria are equivalent to Support Vector Machines with suitable smooth loss
functions, approximating the non-smooth hinge loss function or the hard error
(e.g. phone error). Third, CRFs are often estimated using an
entropy based criterion in combination with Generalized Iterative
Scaling (GIS). GIS offers, upon others, the immediate
advantages that it is locally convergent, completely parameter free, and
guarantees an improvement of the criterion in each step. Here, GIS is extended
to allow for training log-linear models with hidden variables and optimization
of discriminative training criteria different from Maximum Entropy/Maximum
Mutual Information, including Minimum Phone Error (MPE). Finally, experimental
results are provided for different tasks, including the European Parliament
Plenary Sessions task as well as Mandarin Broadcasts.

**On the Role of Local Learning for
Language Modeling**

Mari Ostendorf, University of Washington

Abstract:

Local learning methods, such
as nearest-neighbor and variants, are known to be very powerful for many
problems, particularly for problems where good models are not available. They
can also be very useful for problems with a high degree of variability over the
input space. In language modeling for speech recognition, local learning has
not been particularly useful, in part because of the tremendous power of the
n-gram when given large amounts of training data, and in part due to the
difficulty of defining distance or similarity measures for word sequences. However, language is quite variable,
depending on both topic and genre, such that a model trained in one domain may
be of little use in another. With the large amount of data available on the
web, and the large number of possible topic/genre combinations, it is of
interest to consider local learning for language model adaptation. In this
talk, we look at leveraging the similarity function in language model
adaptation to benefit from a small neighborhood without losing the power of a
large training corpus.

**Ensemble
Machine Learning Methods for Acoustic Modeling of Speech**

Yunxin Zhao, University of Missouri

Abstract:

Improving recognition
accuracy of human speech by computers has been a long standing challenge. Over
the past few decades, tremendous research efforts have been made on the
optimization of acoustic models. On the other hand, ensemble classifier design
is becoming an important direction in machine learning. Different from the
commonly adopted approach of optimizing a single classifier, ensemble methods
achieve pattern discrimination through synergically
combining many classifiers that are complementary in nature. Ensemble methods
have shown advantages in classification accuracy and robustness in a variety of
application contexts. Aligned with this direction, combining output word
hypotheses from multiple speech recognition systems is being increasingly used
in ASR for boosting the accuracy performance.
Nonetheless, the complexity of speech sound distributions warrants the
exploration of using ensemble methods to build robust and accurate acoustic
models, where the component models of an ensemble can be combined in computing
the acoustic scores during decoding search, for example, at the speech frame
level, and thereby a single recognition system would suffice. Recently, some innovative progresses have
been made in this direction, producing promising results and revealing
attractive properties of ensemble acoustic models. This talk will address several basic issues
in ensemble acoustic modeling, including constructing acoustic model ensembles,
combining acoustic models in an ensemble, measuring the ensemble quality, etc.
Experimental findings will be provided for a conversational speech recognition
task, and a discussion will be made regarding research opportunities along this
path.

**Relations Between
Graph Triangulation, Stack Decoding, and Synchronous Decoding**

Jeff Bilmes, University of Washington

Abstract:

Speech recognition systems have
historically utilized essentially one of two decoding strategies. Stack
decoding (also called asynchronous decoding) allows internal decoding
hypotheses to exist that have an end-time that spans over a potentially wide
range of time frames. Such strategies are amenable to techniques such as
A*-search assuming one has available a reasonable continuation heuristic. An
alternate decoding strategy is the time-synchronous approach, whereby every
active hypothesis has a similar or identical ending time. In this talk, we
relate these two decoding strategies to inference procedures in dynamic
graphical models (which includes Dynamic Bayesian networks and hidden
conditional random fields). In particular, we see that under a hybrid
search/belief-propagation inference scheme, the underlying triangulation of the
graph determines which of the above two decoding strategies are active. The
triangulation, moreover, also suggests decoding strategies that lie somewhere
between strictly synchronous and asynchronous approaches.

**Markov Logic
Networks: A Unified Approach to Language Processing**

Pedro Domingos, University of Washington

Abstract:

Language processing systems
typically have a pipeline architecture, where errors accumulate as information
progresses through the pipeline. The ideal solution is to perform fully joint
learning and inference across all stages of the pipeline (part-of-speech
tagging, parsing, coreference resolution, semantic
role labeling, etc.) To make this possible without collapsing under the weight of
complexity, we need a modeling language that provides a common representation
for all the stages and makes it easy to combine them. Markov logic networks
accomplish this by attaching weights to formulas in first-order logic and
viewing them as templates for features of Markov random fields. In this talk, I
will describe some of the main inference and learning algorithms for Markov
logic, show how Markov logic can be used to implement an end-to-end NLP system,
and present the state-of-the-art results we have obtained with the components
we have implemented so far.

**Some Machine Learning Issues in
Discriminative Bilingual Word Alignment**

Robert Moore, Microsoft Research

Abstract:

Bilingual word alignment is
the task of identifying the word tokens that are translations of each other in
a corpus of sentence pairs that are translations of each other. After being dominated by generative models
since the early 1990s, beginning in 2005 this task has been addressed by a number
of discriminative approaches, resulting in substantially reduced alignment
error rates. In most cases, these
discriminative approaches have used a few hundred parallel sentence pairs with
word alignments annotated, plus hundreds of thousands of parallel sentence
pairs with no word-level annotation, making this task a prime example of
semi-supervised learning. In this talk,
we will look in detail at some of the machine learning issues in one of the
most successful efforts at discriminative word alignment, including benefits of
stacking of learners and refinements of the averaged perceptron
approach to learning classifiers with structured outputs.

**Machine
Learning for Speaker Recognition**

Andreas Stolcke, SRI International

Abstract:

This talk will review some
of the main ML techniques employed in state-of-the-art speaker recognition
systems, in terms of both modeling and feature design. For modeling, the two main paradigms
currently in use are Gaussian mixture models with joint factor analysis, and
support vector machines. The latter in
particular have enabled a wealth of approaches that model speakers via high
dimensional feature vectors drawn from a wide range of observation spaces,
including cepstral, phonetic, prosodic, and lexical
features. A pervasive problem in feature design is how to collapse a
variable-length stream of observations into a fixed-length feature vector. SVM
kernels designed for this situation are based on features generated by
polynomial expansion, N-gram frequencies, and GMM mixture weights.
Miscellaneous other issues include parameter smoothing (prior modeling) and
model combination. It is hoped that the talk will give a glimpse into a
fascinating application domain for machine learning methods, and instigate ML
researchers to contribute to advances in speaker recognition.