uncovering Shared Structures in Multiclass Classification: Difference between revisions
m (→Optimization) |
|||
Line 154: | Line 154: | ||
<center><math>\, \frac{\delta ||X \alpha ||_S}{\delta \alpha} = K \alpha V^tD^{-1}g^{'}(D)V \;\;\; (10)</math></center> | <center><math>\, \frac{\delta ||X \alpha ||_S}{\delta \alpha} = K \alpha V^tD^{-1}g^{'}(D)V \;\;\; (10)</math></center> | ||
Since both <math>\,V</math> and <math>\,D</math> can be found from the SVD of <math>\,\alpha^tK\alpha</math>, using <math>\,(10) | Since both <math>\,V</math> and <math>\,D</math> can be found from the SVD of <math>\,\alpha^tK\alpha</math>, using <math>\,(10)</math>, one can find this gradient in terms of <math>\,K</math> and <math>\,\alpha</math>, and so gradient-based optimizations | ||
can be efficiently applied to a kernel <math>\,k(x, x^{'})</math>. | can be efficiently applied to a kernel <math>\,k(x, x^{'})</math>. | ||
Revision as of 16:58, 23 November 2010
Introduction
In their paper Uncovering Shared Structures in Multiclass Classification, the authors Amit et al. wrote about how hidden structure can be utilized to improve the accuracy in multiclass classification. This notion is often called learning-to-learn or interclass transfer (Thrun, 1996).
The uncovering of such hidden structure was accomplished by a mechanism that learns the underlying characteristics that are shared between the target classes. The benefits of finding such common characteristics was demonstrated in the context of large margin multiclass linear classifiers.
Accurate classification of an instance when there is a large number of target classes is a major challenge in many areas such as object recognition, face identification, textual topic classification, and phoneme recognition.
Usually, when there is a large number of classes, the classes are strongly related to each other and have some underlying common characteristics. As a simple example, even though there are many different kinds of mammals, many of them share common characteristics such as having a striped texture. If such true underlying characteristics that are common to the many different classes can be found, then the effective complexity of the multiclass problem can be significantly reduced.
Simultaneously learning the underlying structure shared by classes is a challenging optimization task. The usual goal of past heuristics was to extract powerful non-linear hidden characteristics. However, this goal was usually not convex, and it often resulted in local minima. In this paper, the authors instead modeled the shared characteristics between classes as linear transformations of the input space, so their model was a linear mapping of the shared features of classes that was followed by a multiclass linear classifier. Their model was not only learned in a convex manner, it also significantly improved the accuracy of multiclass linear classifiers.
Formulation
The ultimate goal of multiclass classification is to learn a mapping [math]\displaystyle{ \,H : \mathcal{X} \mapsto \mathcal{Y} }[/math] from instances in [math]\displaystyle{ \,\mathcal{X} }[/math] to labels in [math]\displaystyle{ \,\mathcal{Y} = \{1, \dots , k\} }[/math].
In this paper, the authors considered a linear classifiers over [math]\displaystyle{ \,\mathcal{X} = R^{n} }[/math] that are parametrized by a vector of weights [math]\displaystyle{ \,W_y \in R^{n} }[/math] for each class [math]\displaystyle{ \,y \in \mathcal{Y} }[/math]. These linear models were of the form:
In the paper, the authors adopted the suggestion of Crammer and Singer (2001), and learned the weights by minimizing a trade-off between an average empirical loss and the following regularizer:
[math]\displaystyle{ \underset{y}\Sigma\;||W_y||^2 = ||W||_{F}^2 }[/math], where [math]\displaystyle{ \,||W||_F }[/math] is the Frobenius norm of [math]\displaystyle{ \,W }[/math] whose columns are the vectors [math]\displaystyle{ \,W_y }[/math]. The loss function suggested by Crammer et al. is the maximal hinge loss made over all comparisons between the correct class and an incorrect class, and is given as [math]\displaystyle{ \,l(W;(x,y)) = \underset{y^' \ne y}max[1 + W_{y^'}^tx - W_{y}^tx]_{+} }[/math], where [math]\displaystyle{ \,[z]_{+} = max(0,z) }[/math]. Then, to find the weights for a trade-off parameter C, the authors used the following learning rule:
Since the authors' goal was to obtain a classifier [math]\displaystyle{ \,W }[/math] that is better at classifying new instances by being able to extract characteristics that are shared among classes, they modeled each common characteristic [math]\displaystyle{ \,r }[/math] as a linear function of the input vectors [math]\displaystyle{ \,x }[/math], denoted [math]\displaystyle{ \,F_r^{t} (x) }[/math]. In this way, the activation of each class [math]\displaystyle{ \,y }[/math] is a linear function [math]\displaystyle{ \,G_y^{t}(F^t x) }[/math] of vectors of common characteristics among classes [math]\displaystyle{ \,F^t x }[/math] rather than being a linear function of the input vectors [math]\displaystyle{ \,x }[/math].
The model in the paper substitutes the weight matrix [math]\displaystyle{ \,W \in \mathbb{R}^{n\times k} }[/math] by [math]\displaystyle{ \,W = FG }[/math], where [math]\displaystyle{ \,F \in \mathbb{R}^{n\times p} }[/math] is a weight matrix whose columns define the [math]\displaystyle{ \,p }[/math] common characteristics among classes, and [math]\displaystyle{ \,G \in \mathbb{R}^{p\times k} }[/math] whose columns predict the classes from on the [math]\displaystyle{ \,p }[/math] common characteristics, as follows:
The authors have showed that regularizing the decomposition [math]\displaystyle{ \,FG }[/math] instead of the Frobenius norm of the weight matrix [math]\displaystyle{ \,W }[/math] can make the resulting model better at generalizing to new instances.
The authors' goal was to simultaneously learn the common characteristics among classes ( [math]\displaystyle{ \,F }[/math] ) and the class weights ( [math]\displaystyle{ \,G }[/math] ). In addition to regularizing [math]\displaystyle{ \,||G||_F }[/math], they also [math]\displaystyle{ \,||F||_F^{2} }[/math], which led to the following learning rule
The number of characteristics shared by classes ( [math]\displaystyle{ \,p }[/math] ) is not limited because, as seen in [math]\displaystyle{ \,(3) }[/math], the regularization uses the norms rather than the dimensionalities of [math]\displaystyle{ \,F }[/math] and [math]\displaystyle{ \,G }[/math].
The optimization problem in [math]\displaystyle{ \,(4) }[/math] is not convex. However, instead of learning [math]\displaystyle{ \,F }[/math] and [math]\displaystyle{ \,G }[/math], the authors considered the trace-norm of [math]\displaystyle{ \,W }[/math], which is [math]\displaystyle{ \,||W||_{\Sigma} = \underset{FG = W} {\min} \;\; \frac{1}{2} (||F||_F^{2} + ||G||_F^{2}) = \underset{i} \Sigma | \gamma_i | }[/math]. The trace-norm is a convex function of [math]\displaystyle{ \,W }[/math], and it can be expressed using [math]\displaystyle{ \,W }[/math]'s singular values, which are the [math]\displaystyle{ \,\gamma_i }[/math]'s.
The authors then re-expressed the optimization problem in [math]\displaystyle{ \,(4) }[/math] as a convex optimization problem that finds [math]\displaystyle{ \,W }[/math], as follows:
The optimization problem in [math]\displaystyle{ \,(5) }[/math] could then be formulated as a semi-definite program(SDP) and be easily solved.
Dualization and Kernelization
Taking into consideration the advantages of large-margin methods, the authors obtained the dual form and then the kernelized form of [math]\displaystyle{ \,(5) }[/math].
Using Lagrange duality (more detailed available here, they obtained the dual of [math]\displaystyle{ \,(5) }[/math], as follows:
[math]\displaystyle{ \, max \;\; \underset{i} \Sigma (-Q_{i_{y_i}}) \;\;\; s.t. \;\;\; \forall_{i,j \ne y_i} \;\; Q_{ij} \ge 0, \;\;\; \forall_i \;\; (-Q_{i_{y_i}}) = \underset{j \ne y_i} \Sigma Q_{ij} \le c, \;\;\; ||XQ||_2 \le 1 \;\;\; (6) }[/math]
In [math]\displaystyle{ \,(6) }[/math], [math]\displaystyle{ \,Q \in \mathbb{R}^{m\times k} }[/math] is the dual Lagrange variable and [math]\displaystyle{ \,||XQ||_2 }[/math] is the spectral norm (the maximal singular value) of [math]\displaystyle{ \,XQ }[/math]. [math]\displaystyle{ \,(5) }[/math] can also be written as a semi-definite program.
The authors re-expressed the spectral norm constraint in (6), which was [math]\displaystyle{ \,||XQ||_2 \le 1 }[/math], as [math]\displaystyle{ \,||Q^t(X^tX)Q||_2 \le 1. }[/math], and re-expressed [math]\displaystyle{ \,(6) }[/math] as:
[math]\displaystyle{ \, max \;\; \underset{i} \Sigma (-Q_{i_{y_i}}) \;\;\; s.t. \;\;\; \forall_{i,j \ne y_i} \;\; Q_{ij} \ge 0, \;\;\; \forall_i \;\; (-Q_{i_{y_i}}) = \underset{j \ne y_i} \Sigma Q_{ij} \le c, \;\;\; ||Q^tKQ||_2 \le 1 \;\;\; (7) }[/math]
In [math]\displaystyle{ \,(7) }[/math], [math]\displaystyle{ \,K = X^tX }[/math] is the Gram matrix. [math]\displaystyle{ \,(7) }[/math] is a convex problem on [math]\displaystyle{ \,Q }[/math] that involves a semidefinite constraint (the spectral-norm constraint) on [math]\displaystyle{ \,Q^tKQ }[/math] that does not depend on the size of the training set, but rather on the number of classes [math]\displaystyle{ \,k }[/math].
Using [math]\displaystyle{ \,(7) }[/math] and Theorem 1 that is given below (the proof is in the paper listed in Reference), the authors were able to
find the optimum weight matrix [math]\displaystyle{ \,W }[/math] in terms of the dual optimum [math]\displaystyle{ \,Q }[/math], and they were thus able to use the kernel mechanism for prediction of new instances.
Theorem 1 (Representer Theorem):
Let [math]\displaystyle{ \,Q }[/math] be the optimum of [math]\displaystyle{ \,(7) }[/math] and [math]\displaystyle{ \,V }[/math] be the matrix of eigenvectors of [math]\displaystyle{ \,Q^'KQ }[/math], then for some diagonal [math]\displaystyle{ \,D \in \mathbb{R}^{k\times k} }[/math], the matrix [math]\displaystyle{ \,W = X(QV^tDV) }[/math] is an optimum of [math]\displaystyle{ \,(6) }[/math], with [math]\displaystyle{ \,||W||_{\Sigma} = \Sigma_r \;|D_{rr}| }[/math].
By substituting [math]\displaystyle{ \,W = XQV^tDV }[/math] into [math]\displaystyle{ \,(6) }[/math], the first term becomes [math]\displaystyle{ \,\Sigma_r \;|D_{rr}| }[/math], while the second term is piecewise linear in [math]\displaystyle{ \,KQV^tDV }[/math]. Therefore, we can easily solve a simple linear program (LP) in the [math]\displaystyle{ \,k }[/math] unknown entries on the diagonal of [math]\displaystyle{ \,D }[/math] to find [math]\displaystyle{ \,D }[/math] and hence find [math]\displaystyle{ \,W }[/math]. It should be noted that the number of variables in this LP depends only on the number of classes ([math]\displaystyle{ \,k }[/math]), and that the entire procedure of solving [math]\displaystyle{ \,(7) }[/math] which consists of extracting [math]\displaystyle{ \,V }[/math] and then finding [math]\displaystyle{ \,D }[/math] and thus [math]\displaystyle{ \,W }[/math] uses only the Gram matrix ([math]\displaystyle{ \,K }[/math]).
Note that, the following corollary immediately follows from Theorem 1:
Corollary 1:
There exists [math]\displaystyle{ \,\alpha \in \mathbb{R}^{m\times k} }[/math] such that [math]\displaystyle{ \,W = X\alpha }[/math] is an optimum of [math]\displaystyle{ \,(5) }[/math].
Learning a Latent Feature Representation
As mentioned above, learning [math]\displaystyle{ \,F }[/math] can be interpreted as learning a latent feature space [math]\displaystyle{ \,F^tX }[/math], which is useful for prediction. Because [math]\displaystyle{ \,F }[/math] is learned jointly over all classes, it effectively transfers knowledge between the classes.
The authors considered [math]\displaystyle{ \,k }[/math] binary classification tasks, and used [math]\displaystyle{ \,W_j }[/math] as a linear predictor for the [math]\displaystyle{ \,j }[/math]th task. Using an SVM to learn each class independently, they came up with the following learning rule:
[math]\displaystyle{ \underset{W}{\min} \;\; \underset{j}\Sigma\; (\frac{1}{2}\; ||W_i||^2 + C\;l_j(W_j)) = \underset{W}{\min}\; \frac{1}{2}\; ||W||_F^2 + C \;\underset{j}\Sigma \;l_j(W_j) \;\;\; (8) }[/math], where [math]\displaystyle{ \,l_j(W_j) }[/math] is the total (hinge) loss of [math]\displaystyle{ \,W_j }[/math] on the training examples for task [math]\displaystyle{ \,j }[/math].
Replacing the Frobenius norm with the trace norm ( [math]\displaystyle{ \, \underset{W}{\min}\; ||W||_{\Sigma} + C\; \underset{j} \Sigma\; l_j(W_j) }[/math] ) corresponds to learning a feature representation [math]\displaystyle{ \,\phi(x) = F^tx }[/math] that allows good, low-norm prediction for all [math]\displaystyle{ \,k }[/math] tasks. After such a feature representation is learned, a new task can be learned directly using the feature vectors
[math]\displaystyle{ \,F^tx }[/math] and standard SVM whilst taking advantage of the transferred knowledge from the other, previously-learned,
tasks. It should be noted that we can learn a feature representation [math]\displaystyle{ \,\phi(x) = F^tx }[/math] even if we do not have the feature
representation [math]\displaystyle{ \,X }[/math]. This is because we only use a kernel [math]\displaystyle{ \,k }[/math] to obtain the Gram matrix [math]\displaystyle{ \,K = X^tX }[/math].
Using Corollary 1 given above, the goal of the authors was to obtain a matrix [math]\displaystyle{ \,\alpha }[/math] such that [math]\displaystyle{ \,W = X \alpha }[/math] is an optimum of [math]\displaystyle{ \,(5) }[/math].
Let [math]\displaystyle{ \,W = UDV }[/math] be the singular value decomposition of
[math]\displaystyle{ \,W }[/math]. Then, [math]\displaystyle{ F = U \sqrt{D} }[/math] is an optimum of [math]\displaystyle{ \,(4) }[/math]. The singular value decomposition of [math]\displaystyle{ \,\alpha^tK\alpha = \alpha^tX^tX\alpha = W^tW = V^tD^2V }[/math] can then be calculated to obtain [math]\displaystyle{ \,D }[/math] and [math]\displaystyle{ \,V }[/math].
The result is [math]\displaystyle{ \,D^{−1/2}V\alpha^tK = D^{−1/2}V(\alpha^tX^t)X = D^{−1/2}VW^tX = D^{−1/2}VV^tDU^tX = D^{1/2}U^tX = F^tX }[/math], which provides an explicit representation of the learned feature space that can be calculated from [math]\displaystyle{ \,K }[/math] and [math]\displaystyle{ \,\alpha }[/math] alone.
Optimization
The optimization problem [math]\displaystyle{ \,(5) }[/math] can be formulated as a semi-definite program (SDP) and easily solved using off-the-shelf SDP solvers to find the optimal [math]\displaystyle{ \,W }[/math]. However, such solvers are typically based on interior point methods and so they scale poorly with the size of the problem. As a result, the authors optimized [math]\displaystyle{ \,(5) }[/math] using using simple though powerful gradient-based methods.
Because [math]\displaystyle{ \,(5) }[/math] is non-differentiable, gradient-based cannot be directly applied to it. The authors therefore considered a
smoothed approximation to [math]\displaystyle{ \,(5) }[/math]. This was done by replacing the trace-norm of [math]\displaystyle{ \,W }[/math] ( given above as [math]\displaystyle{ \,||W||_{\Sigma} = \underset{FG = W} {\min} \;\; \frac{1}{2} (||F||_F^{2} + ||G||_F^{2}) = \underset{i} \Sigma\; | \gamma_i | }[/math] ) with a smooth proxy, and this was in turn done by replacing the non-smooth absolute value in this trace norm with a smooth function [math]\displaystyle{ \,g }[/math] that was defined as:
[math]\displaystyle{ \,g(\gamma) = \left\{\begin{matrix} {\frac {\gamma^{2}}{2r} + \frac {r}{2}} &\text{, if } \gamma \le r \\ |\gamma| &\text{, if otherwise } \end{matrix}\right. }[/math]
In [math]\displaystyle{ \,g(\gamma) }[/math], [math]\displaystyle{ \,r }[/math] is some predefined cutoff point. It is easy to see that [math]\displaystyle{ \,g }[/math] is continuously differentiable.
Using [math]\displaystyle{ \,g }[/math], the authors replaced the trace norm [math]\displaystyle{ \,||W||_{\Sigma} = \underset{i} \Sigma \; | \gamma_i | }[/math] by [math]\displaystyle{ \,||W||_{S} = \underset{i} \Sigma \; g(\gamma_i) }[/math], where the [math]\displaystyle{ \,\gamma_i }[/math]'s are the singular values of [math]\displaystyle{ \,W }[/math].
Ultimately, the authors re-expressed [math]\displaystyle{ \,(5) }[/math] with the following convex and continuously-differentiable function:
In the following figures (taken from the authors' paper given in Reference), the left one illustrates how well the matrix [math]\displaystyle{ \,W }[/math] found by solving [math]\displaystyle{ \,(9) }[/math] using conjugate gradient descent approximates the matrix [math]\displaystyle{ \,W }[/math] found by solving [math]\displaystyle{ \,(5) }[/math] using an interior point SDP solver (SDP3), and the right one illustrates how much better the gradient-based optimization in [math]\displaystyle{ \,(9) }[/math] is compared to the semi-definite programming optimization problem [math]\displaystyle{ \,(5) }[/math] in terms of scaling to an increasing number of training instances.
As mentioned above, a kernelized gradient-based optimization approach can be used to solve [math]\displaystyle{ \,(9) }[/math] if we do not have the feature vectors [math]\displaystyle{ \,X }[/math] and we only have the Gram matrix [math]\displaystyle{ \,K = X^tX }[/math]. Using Corollary 1 given above, we know the optimum of [math]\displaystyle{ \,(5) }[/math] has the form [math]\displaystyle{ \,W = X\alpha }[/math], and so we can substitute [math]\displaystyle{ \,W = X\alpha }[/math] into [math]\displaystyle{ \,(9) }[/math] and then minimize over [math]\displaystyle{ \,\alpha }[/math].
To obtain this kernelized gradient-based approach, the authors let [math]\displaystyle{ \,X\alpha = UDV }[/math] denote the SVD of [math]\displaystyle{ \,X\alpha }[/math], and then obtained the SVD of [math]\displaystyle{ \,\alpha^tK\alpha }[/math] as [math]\displaystyle{ \,V^tD^2V }[/math]. They thus recovered [math]\displaystyle{ \,D }[/math] from the SVD of [math]\displaystyle{ \,α^tKα }[/math], and used [math]\displaystyle{ \,||W||_{S} = \underset{i} \Sigma \; g(\gamma_i) }[/math] to find [math]\displaystyle{ \,||X\alpha||_S }[/math].
The authors computed (details of the steps are given in the paper in Reference) the gradient of [math]\displaystyle{ \,||X\alpha||_S }[/math] with respect to [math]\displaystyle{ \,\alpha }[/math] as:
Since both [math]\displaystyle{ \,V }[/math] and [math]\displaystyle{ \,D }[/math] can be found from the SVD of [math]\displaystyle{ \,\alpha^tK\alpha }[/math], using [math]\displaystyle{ \,(10) }[/math], one can find this gradient in terms of [math]\displaystyle{ \,K }[/math] and [math]\displaystyle{ \,\alpha }[/math], and so gradient-based optimizations can be efficiently applied to a kernel [math]\displaystyle{ \,k(x, x^{'}) }[/math].
Spectral Properties of Trace Norm Regularization
When the authors minimized [math]\displaystyle{ \,||F||_{F}^{2} + ||G||_{F}^{2} }[/math] instead of [math]\displaystyle{ \,||W||_{F}^{2} }[/math], they imposed a regularization preference for an [math]\displaystyle{ \,L_1 }[/math] norm, instead of an [math]\displaystyle{ \,L_2 }[/math] norm, on the spectrum of [math]\displaystyle{ \,W }[/math].
When the various target classes share common characteristics, it should be expected that the spectrum of [math]\displaystyle{ \,W }[/math] would not be uniform, since a large portion of this spectrum must be concentrated on a few eigenvalues. The [math]\displaystyle{ \,L_2 }[/math] spectrum regularization due to the Frobenius norm tends to attenuate the spectrum of [math]\displaystyle{ \,W }[/math], and it is therefore not suitable to be used. On the other hand, the [math]\displaystyle{ \,L_1 }[/math] spectrum regularization due to the trace norm does not have this tendency, and it is therefore much better suited for preserving the underlying structures of characteristics that are shared between the target classes.
The authors performed an experiment (details can be found in their paper listed in Reference), in which they found two matrices [math]\displaystyle{ \,W_F }[/math] and [math]\displaystyle{ \,W_{\Sigma} }[/math] using the Frobenius norm optimization ( [math]\displaystyle{ \,(2) }[/math] given above ) and the trace norm optimization ( [math]\displaystyle{ \,(5) }[/math] given above ). Using 500 test instances, the authors found that the generalization error for [math]\displaystyle{ \,W_F }[/math], which was 47%, was much higher than that for [math]\displaystyle{ \,W_{\Sigma} }[/math], which was 31%.
NOT DONE YET. PLEASE DO NOT CONTRIBUTE. THANKS.