Nebojsa Jojic, Alessandro Perina, and Vittorio Murino
In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first analysis goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g. Photosynth) or appearance-based clustering models (e.g. the epitome), is impractical due to either the large dataset size or the dramatic variation in the lighting conditions.
As a remedy to these problems, we introduce a novel image representation, the “stel epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T, which, as in previous epitome models, defines a mapping between the image-coordinates and the coordinates in the large "all-I-have-seen" epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap, with this overlap indicating image similarity. However, in our model the image similarity does not depend on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination, that tend to uniformly affect pixels belonging to a single scene or object part.
|Published in||Neural Information Processing Systems|