Generative models for 2-D images of 3-D scenes
Anitha Kannan, Ph.D.
thesis, University of Toronto, 2006
Supervisor: Brendan J. Frey
Committee members: Geoff Hinton and Rich Zemel
External examiner: Bill Freeman
________________________________________________________________________________________________________________________________
Thesis Abstract: [Download the thesis]
This thesis
introduces generative models of appearance for analyzing 2D images of 3D visual
scenes. Many 2D images contain multiple
objects, where image components corresponding to each object undergo
deformations, eg., due to changes in the shape and
appearance of the object, non-uniform scaling caused by camera zoom and global transformations
such as changes in the location of the object. In this thesis, we use Bayesian
networks to model these sources of variability and their potentially noisy
interactions while describing an image. Given an observed image, the model is
used to infer the distribution over the possible explanations of the unobserved
sources. But, there is an exponentially large number of combinations, so an
exact search over these combinations is practically infeasible. This thesis
introduces approximate inference and learning algorithms based on variational methods for inferring the underlying explanations
of the observed data.
In the first
part of this thesis, a fast unsupervised algorithm for learning an object’s
appearance in a linear subspace, while being invariant to global transformations
such as translations, rotations, and scalings is
introduced. For modelling multiple objects in the scene, the second part of the
thesis presents a generative model for explaining an image using a layered composition
of “card-board cutouts”. With this, each object is
accounted for by a model of 2D image, including a transparency map that
specifies pixels belonging to the object. Each such “layer” includes hidden
variables that account for sources of variability such as changes in
appearance, shape, deformation and global transformation. The model is learned
using a novel unsupervised variational technique that
uses a video sequence as the only input. In the final part of the thesis, a new
representation of image data called the “epitome” is introduced. An epitome is
useful for representing repetitive texture features in an image with both high and
low-frequency components and with a variety of spatial scales. This
representation is used to account for both appearance (texture) and 2D shape (shapes
of object boundaries) in a generative model of a single image containing two
objects. Inference in this model enables segmenting a single image into two layers
of appearance and shape. Thus, in this thesis, we advocate the use of
probability models as a natural way to represent, learn and make inferences about
visual data.