This thesis investigates two well defined problems in image segmentation, viz. interactive and semantic image segmentation. Interactive segmentation involves power assisting a user in cutting out objects from an image, whereas semantic segmentation involves partitioning pixels in an image into object categories. We investigate various models and energy formulations for both these problems in this thesis.
In order to improve the performance of interactive systems, low level texture features are introduced as a replacement for the more commonly used RGB features. To quantify the improvement obtained by using these texture features, two annotated datasets of images are introduced (one consisting of natural images, and the other consisting of camouflaged objects). A significant improvement in performance is observed when using texture features for the case of monochrome images and images containing camouflaged objects. We also explore adding mid-level cues such as shape constraints into interactive segmentation by introducing the idea of geodesic star convexity, which extends the existing notion of a star convexity prior in two important ways: (i) It allows for multiple star centres as opposed to single stars in the original prior and (ii) It generalises the shape constraint by allowing for Geodesic paths as opposed to Euclidean rays. Global minima of our energy function can be obtained subject to these new constraints. We also introduce Geodesic Forests, which exploit the structure of shortest paths in implementing the extended constraints. These extensions to star convexity allow us to use such constraints in a practical segmentation system. This system is evaluated by means of a "robot user" to measure the amount of interaction required in a precise way, and it is shown that having shape constraints reduces user effort significantly compared to existing interactive systems. We also introduce a new and harder dataset which augments the existing GrabCut dataset with more realistic images and ground truth taken from the PASCAL VOC segmentation challenge.
In the latter part of the thesis, we bring in object category level information in order to make the interactive segmentation tasks easier, and move towards fully automated semantic segmentation. An algorithm to automatically segment humans from cluttered images given their bounding boxes is presented. A top down segmentation of the human is obtained using classifiers trained to predict segmentation masks from local HOG descriptors. These masks are then combined with bottom up image information in a local GrabCut like procedure. This algorithm is later completely automated to segment humans without requiring a bounding box, and is quantitatively compared with other semantic segmentation methods. We also introduce a novel way to acquire large quantities of segmented training data relatively effortlessly using the Kinect. In the final part of this work, we explore various semantic segmentation methods based on learning using bottom up superpixelisations. Different methods of combining multiple super-pixelisations are discussed and quantitatively evaluated on two segmentation datasets. We observe that simple combinations of independently trained classifiers on single super-pixelisations perform almost as good as complex methods based on jointly learning across multiple super-pixelisations. We also explore CRF based formulations for semantic segmentation, and introduce novel visual words based object boundary description in the energy formulation. The object appearance and boundary parameters are trained jointly using structured output learning methods, and the benefit of adding pairwise terms is quantified on two different datasets.
|Institution||University of Oxford|