|Introduction Method Applications|
We start with Nk base descriptors and associated kernel functions K1, ..., KNk. Each descriptor achieves a different trade-off between discriminative power and invariance on the specified task. The optimal descriptor's kernel is approximated as Kopt = Σk dk Kk where the weights d correspond to the trade-off level. The optimisation is carried out in an SVM framework and we set up the following primal cost function
The objective function is near identical to the standard l1 C-SVM objective. Given the misclassification penalty C, it maximises the margin while minimising the hinge loss on the training set (xi, yi). The only addition is an l1 regularisation on the weights d since we would like to discover a minimal set of invariances. Thus, most of the weights will be set to zero depending on the parameters σ which encode our prior preferences for descriptors. The l1 regularisation thus prevents overfitting if many base kernels are included since only a few will end up being used. Also, it can be shown that the quantity wTw is minimised by increasing the weights and letting the support vectors tend to zero. The regularisation prevents this from happening and can therefore be seen as not letting the weights become too large. This could also be achieved by requiring that the weights sum to unity but we prefer not to do this as it restricts the search space.
The constraints are also similar to the standard SVM formulation. Only two additional constraints have been incorporated. The first, d ≥ 0, ensures that the weights are interpretable and also leads to a much more efficient optimisation problem. The second, Ad ≥ p, with some restrictions, lets us encode our prior knowledge about the problem. The final condition is just a restatement of Kopt = Σk dk Kk using the non-linear embedding Φ.
It is straightforward to derive the corresponding dual problem which turns out to be
where the non-zero αs correspond to the support vectors, Y is a diagonal matrix with the labels on the diagonal and Ak is the kth column of A. The dual is convex with a unique global optimum. It is an instance of a Second Order Cone Program and can be solved relatively efficiently by off-the-shelf numerical optimisation packages such as SeDuMi.
However, in order to tackle large scale problems involving hundreds of kernels we adopt the minimax optimisation strategy of [Chapelle et al. ML 2002, Rakotomamonjy et al. ICML 2007]. In their method, the primal is reformulated as Mind T(d) subject to d ≥ 0 and Ad ≥ p, where
|Figure 1. A contour plot of T and a surface plot of -T for a binary classification task (Gerenuk versus Pagoda) on the Caltech 101 database when only two kernels are used.|
Fig.1 shows plots of T for a binary classification task from the Caltech 101 database. The optimisation strategy is now to minimise T using projected gradient descent via the iteration dn+1 = dn - εn∇T taking care to ensure that the constraints are satisfied. The important step then is calculating ∇T. In order to do so, we look to the dual of T which is
By the principle of strong duality T(d) = W(d). Furthermore, if alpha* maximises W, then [Bonnans and Shapiro] have shown that W is differentiable if alpha* is unique (which it is in our case since all the kernel matrices are strictly positive definite). Finally, as proved in Lemma 2 of [Chapelle et al. ML 2002], W can be differentiated with respect to d as if alpha* did not depend on d. We therefore get
The minimax algorithm proceeds in two stages. In the first, d and therefore K = Σkdk Kk are fixed. Since σtd is a constant, W is the standard SVM dual with kernel matrix K. Any large scale SVM solver of choice can therefore be used to maximise W and obtain alpha*. In the second stage, T is minimised by projected gradient descent according to dn+1 = dn - εn∇T taking care to ensure that the constraints are satisfied. The stepsize εn is chosen as a variant of the Armijo rule so as to guarantee convergence. The two stages are repeated until either the algorithm converges or a maximum number of iterations is reached.
A novel point x can now be classified as ± 1 by determining sign( Σi yi α*i Kopt(x,xi) + b). To handle multi-class problems, both 1-vs-1 and 1-vs-All formulations are tried. For 1-vs-1, the task is divided into pairwise binary classification problems and a novel point is classified by taking the majority vote over classifiers. For 1-vs-All, one classifier is learnt per class and a novel point is classified according to its maximal distance from the separating hyperplanes.
|«||Introduction Method Applications||»|