Fix your classifier: the marginal value of training the last weight layer: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 48: Line 48:
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }


and reducing the expected negative log likelihood with respect to ground-truth target <math> t $isin; </math> { <math> {1, . . . , C} </math> },
and reducing the expected negative log likelihood with respect to ground-truth target <math> t &isin; </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:
by minimizing the loss function:


Line 68: Line 68:
</math></center>
</math></center>


This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>.
This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The additional vector of bias parameters <math>b &isin; R^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:
 
<center>
<math>
v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; </math> { <math> {1,...,C} </math>}
</center>
 
and the loss to be minimized is:
 
<center><math>
L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + log (\sum_{i=1}^{C} exp((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i)))
</math></center>
 
where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t &isin; </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the <math> \alpha </math> parameter over time, which is logarithmic in nature, is shown in
Figure 1.

Revision as of 12:51, 27 November 2018

Introduction

Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.

Brief Overview

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

Previous Work

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

  • Weight sharing and specification (Han et al., 2015)
  • Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)
  • Low-rank approximations to speed up CNN (Tai et al., 2015)
  • Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

Background

Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.

Shortcomings of the final classification layer and its solution

Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

Proposed Method

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

Using a fixed classifier

Suppose the final representation obtained by the network (the last hidden layer) is represented as [math]\displaystyle{ x = F(z;\theta) }[/math] where [math]\displaystyle{ F }[/math] is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, [math]\displaystyle{ y = W^T x + b }[/math] ,where [math]\displaystyle{ W }[/math] and [math]\displaystyle{ b }[/math] are also trained by back-propagation.

For input [math]\displaystyle{ x }[/math] of [math]\displaystyle{ N }[/math] length, and [math]\displaystyle{ C }[/math] different possible outputs, [math]\displaystyle{ W }[/math] is required to be a matrix of [math]\displaystyle{ N × C }[/math]. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

[math]\displaystyle{ v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; }[/math] { [math]\displaystyle{ {1, . . . , C} }[/math] }

and reducing the expected negative log likelihood with respect to ground-truth target [math]\displaystyle{ t &isin; }[/math] { [math]\displaystyle{ {1, . . . , C} }[/math] }, by minimizing the loss function:

[math]\displaystyle{ L(x, t) = − log {v_t} = −{w_t}·{x} − b_t + log ({\sum_{j}^{C}e^{w_j . x + b_j}}) }[/math]

where [math]\displaystyle{ w_i }[/math] is the [math]\displaystyle{ i }[/math]-th column of [math]\displaystyle{ W }[/math].

Choosing the projection matrix

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix [math]\displaystyle{ W }[/math] is replaced with a fixed orthonormal projection [math]\displaystyle{ Q &isin; R^{N×C} }[/math], such that [math]\displaystyle{ &forall; i &ne; j : q_i · q_j = 0 }[/math] and [math]\displaystyle{ || q_i ||_{2} = 1 }[/math], where [math]\displaystyle{ q_i }[/math] is the [math]\displaystyle{ i }[/math]th column of [math]\displaystyle{ Q }[/math]. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued [math]\displaystyle{ L_{2} }[/math] norm, we find it beneficial to also restrict the representation of [math]\displaystyle{ x }[/math] by normalizing it to reside on the [math]\displaystyle{ n }[/math]-dimensional sphere:

[math]\displaystyle{ \hat{x} = \frac{x}{||x||_{2}} }[/math]

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that [math]\displaystyle{ q_i · \hat{x} }[/math] is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale [math]\displaystyle{ T }[/math] applied to softmax inputs [math]\displaystyle{ f(y) = softmax(\frac{1}{T}y) }[/math], also known as the softmax temperature. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter [math]\displaystyle{ \alpha }[/math] to learn the softmax scale, effectively functioning as an inverse of the softmax temperature [math]\displaystyle{ \frac{1}{T} }[/math]. The additional vector of bias parameters [math]\displaystyle{ b &isin; R^{C} }[/math] is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

[math]\displaystyle{ v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; }[/math] { [math]\displaystyle{ {1,...,C} }[/math]}

and the loss to be minimized is:

[math]\displaystyle{ L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + log (\sum_{i=1}^{C} exp((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i))) }[/math]

where [math]\displaystyle{ x }[/math] is the final representation obtained by the network for a specific sample, and [math]\displaystyle{ t &isin; }[/math] { [math]\displaystyle{ {1, . . . , C} }[/math] } is the ground-truth label for that sample. The behaviour of the [math]\displaystyle{ \alpha }[/math] parameter over time, which is logarithmic in nature, is shown in Figure 1.