uncovering Shared Structures in Multiclass Classification: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 66: Line 66:


In <math>\,(6)</math>, <math>\,K = X^tX</math> is the [http://mathworld.wolfram.com/GramMatrix.html Gram matrix].
In <math>\,(6)</math>, <math>\,K = X^tX</math> is the [http://mathworld.wolfram.com/GramMatrix.html Gram matrix].
(6) is a convex problem on <math>\,Q</math> that involves a semidefinite constraint (the spectral-norm constraint) on <math>\,Q^tKQ</math> whose size does not depend on the size of the training set, but rather on the number of classes <math>\,k</math>.

Revision as of 19:40, 22 November 2010

Introduction

In their paper Uncovering Shared Structures in Multiclass Classification, the authors Amit et al. wrote about how hidden structure can be utilized to improve the accuracy in multiclass classification. This notion is often called learning-to-learn or interclass transfer (Thrun, 1996).

The uncovering of such hidden structure was accomplished by a mechanism that learns the underlying characteristics that are shared between the target classes. The benefits of finding such common characteristics was demonstrated in the context of large margin multiclass linear classifiers.

Accurate classification of an instance when there is a large number of target classes is a major challenge in many areas such as object recognition, face identification, textual topic classification, and phoneme recognition.

Usually, when there is a large number of classes, the classes are strongly related to each other and have some underlying common characteristics. As a simple example, even though there are many different kinds of mammals, many of them share common characteristics such as having a striped texture. If such true underlying characteristics that are common to the many different classes can be found, then the effective complexity of the multiclass problem can be significantly reduced.

Simultaneously learning the underlying structure shared by classes is a challenging optimization task. The usual goal of past heuristics was to extract powerful non-linear hidden characteristics. However, this goal was usually not convex, and it often resulted in local minima. In this paper, the authors instead modeled the shared characteristics between classes as linear transformations of the input space, so their model was a linear mapping of the shared features of classes that was followed by a multiclass linear classifier. Their model was not only learned in a convex manner, it also significantly improved the accuracy of multiclass linear classifiers.


Formulation

The ultimate goal of multiclass classification is to learn a mapping [math]\displaystyle{ \,H : \mathcal{X} \mapsto \mathcal{Y} }[/math] from instances in [math]\displaystyle{ \,\mathcal{X} }[/math] to labels in [math]\displaystyle{ \,\mathcal{Y} = \{1, \dots , k\} }[/math].

In this paper, the authors considered a linear classifiers over [math]\displaystyle{ \,\mathcal{X} = R^{n} }[/math] that are parametrized by a vector of weights [math]\displaystyle{ \,W_y \in R^{n} }[/math] for each class [math]\displaystyle{ \,y \in \mathcal{Y} }[/math]. These linear models were of the form:

[math]\displaystyle{ \,H_W(x) = \arg\max_{y \in \mathcal{Y}} W_{y}^{t} x \;\;\; (1) }[/math]


Since the authors' goal was to obtain a classifier [math]\displaystyle{ \,W }[/math] that is better at classifying new instances by being able to extract characteristics that are shared among classes, they modeled each common characteristic [math]\displaystyle{ \,r }[/math] as a linear function of the input vectors [math]\displaystyle{ \,x }[/math], denoted [math]\displaystyle{ \,F_r^{t} (x) }[/math]. In this way, the activation of each class [math]\displaystyle{ \,y }[/math] is a linear function [math]\displaystyle{ \,G_y^{t}(F^t x) }[/math] of vectors of common characteristics among classes [math]\displaystyle{ \,F^t x }[/math] rather than being a linear function of the input vectors [math]\displaystyle{ \,x }[/math].


The model in the paper substitutes the weight matrix [math]\displaystyle{ \,W \in \mathbb{R}^{n\times k} }[/math] by [math]\displaystyle{ \,W = FG }[/math], where [math]\displaystyle{ \,F \in \mathbb{R}^{n\times p} }[/math] is a weight matrix whose columns define the [math]\displaystyle{ \,p }[/math] common characteristics among classes, and [math]\displaystyle{ \,G \in \mathbb{R}^{p\times k} }[/math] whose columns predict the classes from on the [math]\displaystyle{ \,p }[/math] common characteristics, as follows:

[math]\displaystyle{ H_{G,F}(x) = \arg\max_{y \in \mathcal{Y}}G_y^{t} (F^t x) = \arg\max_{y \in \mathcal{Y}}(FG)_y^{t} x \;\;\; (2) }[/math]


The authors have showed that regularizing the decomposition [math]\displaystyle{ \,FG }[/math] instead of the Frobenius norm of the weight matrix [math]\displaystyle{ \,W }[/math] can make the resulting model better at generalizing to new instances.


The authors' goal was to simultaneously learn the common characteristics among classes ([math]\displaystyle{ \,F }[/math]) and the class weights ([math]\displaystyle{ \,G }[/math]). In addition to regularizing [math]\displaystyle{ \,||G||_F }[/math], they also [math]\displaystyle{ \,||F||_F^{2} }[/math], which led to the following learning rule

[math]\displaystyle{ \underset{F,G}{\min} \;\; \frac{1}{2} ||F||_F^{2} + \frac{1}{2} ||G||_F^{2} + C \sum_{i=1}^{m} l(FG;(x_i,y_i)) \;\;\; (3) }[/math]


The number of characteristics shared by classes ([math]\displaystyle{ \,p }[/math]) is not limited because, as seen in [math]\displaystyle{ \,(3) }[/math], the regularization uses the norms rather than the dimensionalities of F and G.


The optimization problem in [math]\displaystyle{ \,(3) }[/math] is not convex. However, instead of learning [math]\displaystyle{ \,F }[/math] and [math]\displaystyle{ \,G }[/math], the authors considered the trace-norm of [math]\displaystyle{ \,W }[/math], which is [math]\displaystyle{ \,||W||_{\Sigma} = \underset{FG = W} {\min} \;\; \frac{1}{2} (||F||_F^{2} + ||G||_F^{2}) }[/math], and they re-expressed the optimization problem in [math]\displaystyle{ \,(3) }[/math] as a convex optimization problem that finds [math]\displaystyle{ \,W }[/math], as follows:

[math]\displaystyle{ \underset{W}{\min} \;\; ||W||_{\Sigma} + C \sum_{i=1}^{m} l(W;(x_i,y_i)) \;\;\; (4) }[/math]


The optimization problem in [math]\displaystyle{ \,(4) }[/math] could then be formulated as a semi-definite program(SDP) and be easily solved.

Dualization and Kernelization

Taking into consideration the advantages of large-margin methods, the authors obtained the dual form and then the kernelized form of [math]\displaystyle{ \,(4) }[/math].

Using Lagrange duality (more detailed available here, they obtained the dual of [math]\displaystyle{ \,(4) }[/math], as follows:

[math]\displaystyle{ \, max \;\; \underset{i} \Sigma (-Q_{i_{y_i}}) \;\;\; s.t. \;\;\; \forall_{i,j \ne y_i} \;\; Q_{ij} \ge 0, \;\;\; \forall_i \;\; (-Q_{i_{y_i}}) = \underset{j \ne y_i} \Sigma Q_{ij} \le c, \;\;\; ||XQ||_2 \le 1 \;\;\; (5) }[/math]


In [math]\displaystyle{ \,(5) }[/math], [math]\displaystyle{ \,Q \in \mathbb{R}^{m\times k} }[/math] is the dual Lagrange variable and [math]\displaystyle{ \,||XQ||_2 }[/math] is the spectral norm (the maximal singular value) of [math]\displaystyle{ \,XQ }[/math]. [math]\displaystyle{ \,(5) }[/math] can also be written as a semi-definite program.


The authors re-expressed the spectral norm constraint in (5), which was [math]\displaystyle{ \,||XQ||_2 \le 1 }[/math], as [math]\displaystyle{ \,||Q^t(X^tX)Q||_2 \le 1. }[/math], and re-expressed [math]\displaystyle{ \,(5) }[/math] as:

[math]\displaystyle{ \, max \;\; \underset{i} \Sigma (-Q_{i_{y_i}}) \;\;\; s.t. \;\;\; \forall_{i,j \ne y_i} \;\; Q_{ij} \ge 0, \;\;\; \forall_i \;\; (-Q_{i_{y_i}}) = \underset{j \ne y_i} \Sigma Q_{ij} \le c, \;\;\; ||Q^tKQ||_2 \le 1 \;\;\; (6) }[/math]


In [math]\displaystyle{ \,(6) }[/math], [math]\displaystyle{ \,K = X^tX }[/math] is the Gram matrix.

(6) is a convex problem on [math]\displaystyle{ \,Q }[/math] that involves a semidefinite constraint (the spectral-norm constraint) on [math]\displaystyle{ \,Q^tKQ }[/math] whose size does not depend on the size of the training set, but rather on the number of classes [math]\displaystyle{ \,k }[/math].