Generative models for 2-D images of 3-D scenes

Anitha Kannan, Ph.D. thesis, University of Toronto, 2006

 

Supervisor: Brendan J. Frey

Committee members: Geoff Hinton and Rich Zemel

External examiner: Bill Freeman

________________________________________________________________________________________________________________________________

 

Thesis Abstract: [Download the thesis]

 

This thesis introduces generative models of appearance for analyzing 2D images of 3D visual  scenes. Many 2D images contain multiple objects, where image components corresponding to each object undergo deformations, eg., due to changes in the shape and appearance of the object, non-uniform scaling caused by camera zoom and global transformations such as changes in the location of the object. In this thesis, we use Bayesian networks to model these sources of variability and their potentially noisy interactions while describing an image. Given an observed image, the model is used to infer the distribution over the possible explanations of the unobserved sources. But, there is an exponentially large number of combinations, so an exact search over these combinations is practically infeasible. This thesis introduces approximate inference and learning algorithms based on variational methods for inferring the underlying explanations of the observed data.

 

In the first part of this thesis, a fast unsupervised algorithm for learning an object’s appearance in a linear subspace, while being invariant to global transformations such as translations, rotations, and scalings is introduced. For modelling multiple objects in the scene, the second part of the thesis presents a generative model for explaining an image using a layered composition of “card-board cutouts”. With this, each object is accounted for by a model of 2D image, including a transparency map that specifies pixels belonging to the object. Each such “layer” includes hidden variables that account for sources of variability such as changes in appearance, shape, deformation and global transformation. The model is learned using a novel unsupervised variational technique that uses a video sequence as the only input. In the final part of the thesis, a new representation of image data called the “epitome” is introduced. An epitome is useful for representing repetitive texture features in an image with both high and low-frequency components and with a variety of spatial scales. This representation is used to account for both appearance (texture) and 2D shape (shapes of object boundaries) in a generative model of a single image containing two objects. Inference in this model enables segmenting a single image into two layers of appearance and shape. Thus, in this thesis, we advocate the use of probability models as a natural way to represent, learn and make inferences about visual data.