Picto: A large scale visual indexing and recognition system

Object image recognition is a challenge but important problem. Towards addressing this problem, we initialed the Picto project. Our research in this project covers three fundamental aspects of this problem: low-level image features, middle level image representations, and indexing and recognition algorithms. We specially emphasize scalability and applicability in our research.

1. Large-scale indexing techniques

Arch of ImageX

In most object image retrieval systems, images are represented by the so-called bag-of-visual-words (BOF) vectors, in which each entry corresponds to a “visual word”. Due to its special characteristics, i.e. high-dimensional and extremely sparse, most of traditional indexing approaches have problems to indexing BOF in terms of efficiency. In this work, we propose to decompose visual words of a BOF vector to three parts: background words, topic words and image specific words [1]. Topic words could be further compressed to a compact vector in a low-dimensional sub-space, and the amount of image specific words could be restricted to be small. These properties enable an efficient indexing framework, which is named as DocX in our papers. We evaluated the framework on various benchmark datasets, and obtained promising results.

2. Incorporating spatial information in index

Spatial relationships among local features are one of the most important information, which are ignored by the bag-of-visual-word representations. To utilize spatial information, most of existing systems adopt two-phase re-ranking approaches. In the first step, they retrieve a set of candidate images by BOF, and in the second step spatial matching is conducted to verify the coherence between the query image and candidate images. However, this approach has two obvious shortcomings, i.e. it is likely to miss good images in the first step, and both its computational and memory cost in the second step are very expensive. To overcome these problems, we propose a so-called spatial-bag-of-features (SBOF). We project local features of an image along some random lines or circles. In this way, we obtain some initial features which are able to capture preliminary spatial relationships among local features. By further manipulations to these basic features, we obtain advanced features which are invariant to either scale, transfer or rotation of objects. Moreover, a good property of the new feature is that it still is a histogram-based representation like BOF. This property enables the new feature could be efficiently indexed by existing techniques. Although this feature is proposed for retrieval, it can be used in may other problems, e.g. object image categorization.

3. Supervised dictionary learning

Bag-of-features (BOF) is a powerful and popular representation of images for visual object categorization. To get the representation, a dictionary must be generated to quantize local features of images. However, most of existing works adopt unsupervised approaches, e.g. k-means, to generate the dictionaries. Obviously, such dictionaries are not optimal for categorization because it ignores supervisory information. To overcome this problem, we propose some supervised probabilistic models to generate dictionaries [2]. The proposed models can be applied to other applications where you need to train supervised Gaussian Mixture Models (GMM). To further improve the capability of the learned dictionaries, we introduce the max-margin criterion in the objective function. the new objective function integrates classifier training and dictionary learning together [4].



[1] Efficient indexing for large scale visual search. Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma, and Heung-Yeung Shum. ICCV 2009

[2] Probabilistic models for supervised dictionary learning. Xiao-chen Lian, Zhiwei Li, Changhu Wang, Bao-liang Lv, and Lei Zhang. CVPR 2010

[3] Spatial-bag-of-features. Yang Cao, Changhu Wang, Zhiwei Li, Liqing Zhang, and Lei Zhang. CVPR 2010

[4] Max-margin dictionary learning for multiclass image categorization. Xiao-chen Lian, Zhiwei Li, Bao-liang Lv, and Lei Zhang. ECCV 2010