Learning the Discriminative Power-Invariance Trade-Off


1. Introduction

We investigate the problem of classifying photographs of objects and materials, obtained under unknown imaging conditions, into one of a set of -specified categories (see Fig. 1). Most solutions to this problem follow a two stage approach. The first stage focuses on extracting compact and relevant feature descriptors from the given training images. Descriptors must have the property that they are good at discriminating between objects from different categories while simultaneously being invariant to imaging variations (such as those due to camera rotation or zoom, changes in illumination, etc) as well as within class variability. The second stage then focuses on learning accurate and efficient classifiers based on the extracted feature descriptors. Our objective, in this project,`is to learn the balance between descriptor invariance and discriminative power from data and prior constraints in a formulation geared specifically towards classification.

Figure 1. The image categorization problem is all about classifying a novel image into one of a set of many categories specified by training images. The problem is very difficult because of the numerous sources of variability. Objects within a class might not resemble each other visually (such as chairs) whereas objects from different classes might (schooners and ketches). Furthremore, changes in imaging conditions can dramatically impact the appearance of objects as is illustrated by the images of the Taj. Non-rigid deformations of the Panda's body add annother layer of complexity.

Designing good descriptors is a fundamental problem in visual classification and many successful ones have been proposed in the literature. If one looks past the initial dissimilarities, what really distinguishes one descriptor from another is the trade-off that it achieves between discriminative power and invariance. For instance, image patches, when compared using standard Euclidean distance, have almost no invariance but very high discriminative power. At the other extreme, a constant descriptor has complete invariance but no discriminative power. Most popular descriptors place themselves somewhere along this spectrum according to what they believe is the optimal trade-off.

(a) (b) (c)
Figure 2. The trade-off between discriminative power and invariance is task dependent: In (a) a rotationally invariant descriptor would be inappropriate while in (b) it would be necessary. However, if a large training corpus is available for a difficult problem as in (c), then we should revert back to a less invariant and more discriminative descriptor as the data itself would provide the invariance.

However, the trade-off between invariance and discriminative power depends on the specific classification task at hand. It varies according to the training data available as well as prior knowledge and thus no single descriptor can be optimal for all tasks. For example, when classifying digits, one would not like to use a fully rotationally invariant descriptor as a 6 would then be mistaken for a 9 (see Fig. 2). If the task was now simplified to distinguishing between just 4 and 9, then it would be preferable to have full rotational invariance if the digits could occur at any arbitrary orientation. However, 4s and 9s are easily confused. Therefore, if a rich enough training corpus was available with digits present at a large number of orientations, then one could revert back to a more discriminative and less invariant descriptor. In this scenario, the data itself would provide the rotation invariance and even nearest neighbour matching of rotationally variant descriptors would do well. As such, even if an optimal descriptor could be hand-crafted for a given task, it might no longer be optimal as the training set size is varied.

Figure 3. Pinpointing the exact level of invariance required is hard. Just scale or rotation invariance won't suffice as some of the examples will be misclassified. Increasing the level of invariance to similarity or affine might produce even worse classification results as 4s and 9s are easily confused and one would like to use as discriminating a descriptor as possible.

As Fig 2 (a) and (b) illustrate, it is often easy to arrive at the broad level of invariance or discriminative power necessary for a particular classification task by visual inspection alone. However, figuring out the exact trade-off can be more difficult (see Fig. 3). Let us go back to our example of classifying 4 versus 9. If only rotated copies of both digits were present in the training set then we could conclude that, broadly speaking, rotationally invariant descriptors would be suited to this task. However, what if some of the rotated digits were now scaled by a small factor, just enough to start causing confusion between the two digits? We might now consider moving up to similarity or affine invariant descriptors. However, this might lead to even poorer classification performance as such descriptors would have lower discriminative power than just rotation or scale invariant ones.

An ideal solution would be for every descriptor to have a continuously tunable meta parameter controlling its level of invariance. By varying the parameter, one could generate an infinite set of base descriptors spanning the complete range of the trade-off and, from this set, select the single base descriptor corresponding to the optimal trade-off level. The optimal descriptor's kernel matrix should have the same structure as the ideal kernel (essentially corresponding to zero intra-class and infinite inter-class distances) as in kernel target alignment. Unfortunately, most descriptors proposed in the literature don't have such a continuously tunable parameter.

It is nevertheless possible to discretely sample the levels of invariance and generate a finite set of base descriptors. The optimal descriptor can still be approximated, not by selecting one of the base descriptors, but rather by taking their combination. However, approximating the ideal kernel via kernel target alignment is no longer appropriate as the method is not geared for classification.

Our solution instead is to combine a minimal set of base descriptors specifically for classification. Let us return to our 4 versus 9 example of Fig. 3. Starting with base descriptors that are rotationally invariant, scale invariant, affine invariant etc., our solution is to approximate the optimal descriptor by combining the rotationally invariant descriptor with just the scale invariant one. The combined descriptor would have neither invariance in full. As a result, the distance between a digit and its rotated copy would no longer be zero, but would still be tolerably small. Similarly, small scale changes would lead to increased, small non-zero distances within class. However, the combined distance between classes would also be increased and by a sufficient enough margin to ensure good classification.

We pose this problem in the kernel learning framework. Rather than explicitly building a combined descriptor, we directly learn a combined kernel (distance) function as this has many advantages. The optimal kernel for the specified task is approximated as that linear combination of base kernels which minimises the hinge loss on the training set subject to regularisation. This leads to a convex optimisation problem with a unique, global optimum which can be solved for efficiently. The learnt base kernel weights indicate the discriminative power-invariance trade-off whereas the learnt kernel can directly lead to superior classification.



Download Matlab code.

« IntroductionMethodApplications »