Fix your classifier: the marginal value of training the last weight layer
Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.
In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.
Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:
- Weight sharing and specification (Han et al., 2015)
- Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)
- Low-rank approximations to speed up CNN (Tai et al., 2015)
- Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b, Li et al., 2016 and Zhou et al., 2016)
Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.
Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.
- Shortcomings of the final classification layer