Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Audio-Visual Graphical Models for Speech Processing

J. Hershey, H. Attias, N. Jojic, and T. Kristjansson


Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered modules. We propose to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data. The video model features a Gaussian mixture model embedded in a linear subspace of a sprite which translates in the video. The system can learn to detect and enhance speech in noise given only a short (30 second) sequence of audio-visual data. We show some results for speech detection and enhancement, and discuss extensions to the model that are under investigation.


Publication typeInproceedings
Published inProc. of the Int. Conf. on Acoustics, Speech, and Signal Processing
> Publications > Audio-Visual Graphical Models for Speech Processing