Introduction • Method • Applications |
We start with N_{k} base descriptors and associated kernel functions K_{1}, ..., K_{Nk}. Each descriptor achieves a different trade-off between discriminative power and invariance on the specified task. The optimal descriptor's kernel is approximated as K_{opt} = Σ_{k} d_{k} K_{k} where the weights d correspond to the trade-off level. The optimisation is carried out in an SVM framework and we set up the following primal cost function
The objective function is near identical to the standard l_{1} C-SVM objective. Given the misclassification penalty C, it maximises the margin while minimising the hinge loss on the training set (x_{i}, y_{i}). The only addition is an l_{1} regularisation on the weights d since we would like to discover a minimal set of invariances. Thus, most of the weights will be set to zero depending on the parameters σ which encode our prior preferences for descriptors. The l_{1} regularisation thus prevents overfitting if many base kernels are included since only a few will end up being used. Also, it can be shown that the quantity w^{T}w is minimised by increasing the weights and letting the support vectors tend to zero. The regularisation prevents this from happening and can therefore be seen as not letting the weights become too large. This could also be achieved by requiring that the weights sum to unity but we prefer not to do this as it restricts the search space.
The constraints are also similar to the standard SVM formulation. Only two additional constraints have been incorporated. The first, d ≥ 0, ensures that the weights are interpretable and also leads to a much more efficient optimisation problem. The second, Ad ≥ p, with some restrictions, lets us encode our prior knowledge about the problem. The final condition is just a restatement of K_{opt} = Σ_{k} d_{k} K_{k} using the non-linear embedding Φ.
It is straightforward to derive the corresponding dual problem which turns out to be
where the non-zero αs correspond to the support vectors, Y is a diagonal matrix with the labels on the diagonal and A_{k} is the k^{th} column of A. The dual is convex with a unique global optimum. It is an instance of a Second Order Cone Program and can be solved relatively efficiently by off-the-shelf numerical optimisation packages such as SeDuMi.
However, in order to tackle large scale problems involving hundreds of kernels we adopt the minimax optimisation strategy of [Chapelle et al. ML 2002, Rakotomamonjy et al. ICML 2007]. In their method, the primal is reformulated as Min_{d} T(d) subject to d ≥ 0 and Ad ≥ p, where
Figure 1. A contour plot of T and a surface plot of -T for a binary classification task (Gerenuk versus Pagoda) on the Caltech 101 database when only two kernels are used. |
Fig.1 shows plots of T for a binary classification task from the Caltech 101 database. The optimisation strategy is now to minimise T using projected gradient descent via the iteration d^{n+1} = d^{n} - ε^{n}∇T taking care to ensure that the constraints are satisfied. The important step then is calculating ∇T. In order to do so, we look to the dual of T which is
By the principle of strong duality T(d) = W(d). Furthermore, if alpha^{*} maximises W, then [Bonnans and Shapiro] have shown that W is differentiable if alpha^{*} is unique (which it is in our case since all the kernel matrices are strictly positive definite). Finally, as proved in Lemma 2 of [Chapelle et al. ML 2002], W can be differentiated with respect to d as if alpha^{*} did not depend on d. We therefore get
The minimax algorithm proceeds in two stages. In the first, d and therefore K = Σ_{k}d_{k} K_{k} are fixed. Since σ^{t}d is a constant, W is the standard SVM dual with kernel matrix K. Any large scale SVM solver of choice can therefore be used to maximise W and obtain alpha^{*}. In the second stage, T is minimised by projected gradient descent according to d^{n+1} = d^{n} - ε^{n}∇T taking care to ensure that the constraints are satisfied. The stepsize ε^{n} is chosen as a variant of the Armijo rule so as to guarantee convergence. The two stages are repeated until either the algorithm converges or a maximum number of iterations is reached.
A novel point x can now be classified as ± 1 by determining sign( Σ_{i} y_{i} α^{*}_{i} K_{opt}(x,x_{i}) + b). To handle multi-class problems, both 1-vs-1 and 1-vs-All formulations are tried. For 1-vs-1, the task is divided into pairwise binary classification problems and a novel point is classified by taking the majority vote over classifiers. For 1-vs-All, one classifier is learnt per class and a novel point is classified according to its maximal distance from the separating hyperplanes.
[2/6] | ||
« | Introduction • Method • Applications | » |