Managing Spoken Documents

Mari Ostendorf, University of Washington, USA


Facial Image Synthesis, Analysis, and Recognition  

Demetri Terzopoulos, University of California at Los Angeles, USA


Advances in Scalable Video Compression   

Jens-Rainer Ohm, RWTH Aachen University, Germany


Smart Surveillance: Advanced Video Analytics and Middleware for Security and Retail Applications  

Andrew Senior, IBM Research, USA


Network Coding for the Internet and Wireless Networks  

Philip Chou, Microsoft Research, USA


Capturing and Rendering Three-Dimensional Auditory and Visual Scenes  

Ramani Duraiswami, University of Maryland, College Park, USA



OVERVIEW TALK 1

 

When:            Wednesday, October 4, 2006, 9:15 AM-10:00 AM

Where:           Crystal Ballroom

  

Title:               Managing Spoken Documents

 

Speaker:        Mari Ostendorf

                       University of Washington, USA

                       mo@ee.washington.edu

 

Abstract:


As storage costs drop and bandwidth increases, there has been a rapid growth of information available via the web or in online archives, raising problems of finding and interpreting collections of documents. Significant recent progress has been made in text retrieval, analysis, summarization and translation, but much of this work has focused on written language.  Increasingly, speech and video signals are also available -- including TV and radio broadcasts, congressional records, oral histories, voicemail, call center recordings, etc. -- which can be thought of as ``spoken documents''.  Because it takes longer to listen to audio than to read text, spoken documents are clearly a prime candidate for automatic indexing, information extraction, and other such technologies.  In this talk we overview speech processing technology that underlies spoken document management, including mathematical frameworks for both word and metadata recognition, and for integrating video and language cues.  In addition, we discuss issues that arise in text processing when moving from written to spoken language and implications for statistical models of language.

 

Outline:


1.         The role of speech in multimedia

- Driver: the problem of managing unstructured information
- Enablers:
            (a) Advances in text processing
            (b) Advances in speech and speaker recognition

2.       Application challenges in processing speech vs. text

-  Speech recognition errors

-  Conversational speech issues (disfluencies, wording)

-  Opportunities in combining speech with video

3.       Mathematical models of speech and language

-  Key issues
      (a) Discriminative vs. generative modeling
      (b) Symbolic and/or continuous observations
      (c) Supervised vs. unsupervised learning

-  Key modeling problems
      (a) Categorization
      (b) Sequence labeling
      (c) Sequence transformation

4.       Challenges of "real" systems: uncertainty propagation, dynamic nature of language, etc.

5.       From speech to video: challenges and opportunities

 

 

OVERVIEW TALK 2

 

When:            Wednesday, October 4, 2006, 1:30 AM-2:15 AM

Where:           Crystal Ballroom

 

Title:              Facial Image Synthesis, Analysis, and Recognition   

 

Speaker:        Demetri Terzopoulos

University of California at Los Angeles

                        dt@cs.ucla.edu

 

Abstract:       

Methodologies spanning computer graphics, computer vision, and machine learning are of significant interest in multimedia signal processing. In particular, we overview several model-based and image-based methods for synthesizing, analyzing, and recognizing facial imagery. The model-based methods include functional models of the human face/head, which incorporates biomechanical synthetic tissues with embedded muscle actuators, and techniques for applying them to computer animation and expression estimation in video. The image-based methods include new representations of facial image ensembles that disentangles pose, illumination, and expression effects to improve facial image compression and recognition. To this end, we present nonlinear generalizations of principal components analysis (PCA) and independent components analysis (ICA) that are based on multilinear algebra, the algebra of higher-order tensors.

 

 

Outline:        

1.         Introduction and motivation

2.         Facial image synthesis

- Model-based methods: Geometric and biomechanical facial models
- Image-based methods: Video-based synthesis of facial animation

3.         Facial image analysis

- Model-based methods: Expression estimation
- Image-based methods: Facial ensemble analysis and learning

4.         Facial image recognition

- Model-based methods: Morphable models
- Image-based methods: Eigenfaces and TensorFaces

5.         Research directions

 


OVERVIEW TALK 3

 

When:             Thursday, October 5, 2006, 9:15 AM-10:00 AM

Where:           Crystal Ballroom

 

Title:               Advances in Scalable Video Compression

Speaker:        Jens-Rainer Ohm
Institute of Communications Engineering

RWTH Aachen University, Germany
ohm@ient.rwth-aachen.de

 

Abstract:

 

The market for digital video is growing, and compression technology is a core enabling technology. However, the interrelationship between transmission networks and compression technology bears many problems yet to be solved. It is well known that efficient scalable representation of video would be useful to provide flexible multi-dimensional resolution adaptation, to support various network and terminal capabilities, and provide better error robustness. Scalability in traditional hybrid video coding has so far been mainly disabled by the drift problem, which occurs when different predictions are made at the encoder and decoder sides. It imposes severe constraints in particular when different dimensions of scalability shall be supported, e.g. to enable changes in temporal and spatial resolution as well as fidelity of quantization. Recent advances in motion-compensated interframe operation include closed-loop prediction with limited drift, and fully open-loop temporal compression methods, such as motion-compensated temporal filtering (MCTF), a framework that extends traditional interframe compression into more general temporal-axis transform schemes. This finally turns efficient scalable video compression into reality. The talk will give an in-depth analysis of the current trends and report about the recent scalability extension of the Advanced Video Coding (AVC/H.264) standard. The new developments for scalable compression have provided a deeper insight into benefits that can be taken from spatio-temporal coherences over shorter and longer distances. Therefore, some of the new compression tools even have the potential to further improve the compression efficiency. On this basis, the talk will give an outlook discussing potential future trends in video compression, which appears still far from reaching its final bounds.

 

Outline:

1.       Advantages of scalable video coding (SVC) from an application perspective

2.       An analysis why scalable video compression did not work efficiently in traditional motion compensated coding schemes

3.       Recent developments of motion-compensated temporal filtering, open-loop compression and drift control schemes that enable scalable video coding with  high compression efficiency

4.       An overview about the present SVC standardization activities in MPEG and  JV

5.       An outlook to potential future improvements

 

 

 

 

OVERVIEW TALK 4

           

 

When:            Thursday, October 5, 2006, 1:30 PM-2:15 PM

Where:           Crystal Ballroom

 

Title:               Smart Surveillance: Advanced Video Analytics and Middleware for Security and Retail Applications

 

Speaker:        Andrew Senior

                       IBM Research, USA

aws@us.ibm.com

 

Abstract:        

Video surveillance is a burgeoning field in computer vision. Terrorism has spurred investment in video surveillance and demands for increased scalability coupled with increased computing power have led to a huge growth in development of automated surveillance software that can "watch" surveillance video instead of fallible security guards. In this tutorial, I will cover the technical computer vision techniques that enable the automation of video surveillance, together with the scalability and data management issues that are required for the deployment of real "smart surveillance" applications, and will end by discussing important issues of performance analysis and privacy.

 

Outline:

1.       Challenges of automated video surveillance

(a) Detection:
        - Background subtraction techniques
        - Alternative detection strategies
(b) Tracking
        - Goals
        - Techniques

2.       Data storage

(a) Metadata extraction

(b) Video storage

3.       Scalability
     (a) Distributed systems
      (b) Edge devices and smart cameras

4.       Building smart surveillance applications
      (a) Smart surveillance capabilities: alerts, search
      (b) Applications: security systems, retail and other

5.       Performance Analysis
      (a) BGS performance
      (b) Tracking performance
      (c) Task performance

6.       Privacy
      (a) Privacy-protecting surveillance
      (b) Surveillance privacy certification

 

 

OVERVIEW TALK 5

 

 

When:            Friday, October 6, 2006, 9:15 AM-10:00 AM

Where:           Crystal Ballroom

 
Title:               Network Coding for the Internet and Wireless Networks
 
Speaker:         Philip Chou
                       Microsoft Research
                       pachou@microsoft.com
 
Abstract:

 

In a network, imagine a pair of 100-Kbps streams arriving at a node from different directions and competing to be forwarded out of the node on a single 100-Kbps bottleneck link.  How can both streams be forwarded through the bottleneck?  The answer is: combine the two bit streams by superimposing them as vectors over a finite field, send the combined stream through the bottleneck link at 100 Kbps, and finally untangle the streams at their destinations, using side information. This is an example of Network Coding, in which a small amount of linear processing at some of the nodes in a network can yield increases in throughput, decreases in delay, and decreases in energy (in a wireless network), relative to what can be achieved if only pure forwarding (no coding) is allowed.  In this talk I will give an introduction to this new field in practical terms: how to improve communication in the Internet and wireless networks.

 
Outline:

1.               Introduction to network coding theory

2.               How to make network coding practical

3.               Internet and wireless applications

a)      Live broadcasting,

b)      File downloading

c)      Storage

d)       Messaging

e)       Interactive communication

f)    Sensor networks

 

 

OVERVIEW TALK 6


 

When:              Friday, October 6, 2006, 1:30 AM-2:15 AM

Where:             Crystal Ballroom

 

Title:               Capturing and Rendering Three-Dimensional Auditory and Visual Scenes
               
 Speaker:        Ramani Duraiswami
                       University of Maryland, College Park, USA
                       ramani@umiacs.umd.edu
 

 

Abstract:

Humans are able to perceive all three dimensions of the world they live in using both the auditory and visual modalities. In this talk we will first briefly consider the cues that lead to perception of a three-dimensional (3D) world. Audible sound waves and visible light waves have different characteristics, and different ways of interacting with the 3D world. The results of this interaction are received by the biological sensors; cues or "features" are extracted, and fused with prior knowledge to perceive the structure of an auditory or visual scene.  When sound is recorded, or a scene is imaged, if the recording must be replayed/rendered to provide a three-dimensional sense, these cues must be preserved. We will discuss various methods for sound and image recording, and playback that are (to differing degrees) able to reproduce the 3D world. Next, we will discuss several issues within this context, including holography, light-field rendering, stereo, shape from shading, shape from motion, and other issues for 3D graphical rendering; ambisonics, wave-field synthesis, head-related transfer functions, and room impulse responses for headphone and speaker system based playback; sound capture via microphone arrays; and the capture of visual scenes via multiple cameras.

 
 
Outline:

1.       The biological and physical basis of  3D perception in vision and audio

2.       Similarities and differences in audio and visual perception of dimensions

3.       What is necessary to capture and playback video and audio that preserves 3D information

4.       Modern research developments in audio

a)      Higher order sound capture via spherical microphone arrays

b)      Measurement and approximation of room transfer functions and head-related transfer functions

c)      Developing auditory virtual reality

5.       Modern research developments in vision/graphics in using light-field representations for recreating three dimensional worlds

6.       An outlook for future developments and research.