Difference between revisions of "meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting"

From statwiki
Jump to: navigation, search
m (Results)
(Proposed Approach)
Line 35: Line 35:
 
=Proposed Approach=
 
=Proposed Approach=
 
[[File:7.png|center|600px]]
 
[[File:7.png|center|600px]]
The original back propagation computes the "full gradient" for the input vector and the weight matrix. However For me-Prop, back propagation computes an "approximate gradient" by keeping top-k values of the backward flowed gradient and masking the remaining values to 0. That is, only the top-k elements with the largest absolute values are kept and rest are made 0.
+
The original back propagation computes the "full gradient" for the input vector and the weight matrix. However, For me-Prop, back propagation computes an "approximate gradient" by keeping top-k values of the backward flowed gradient and masking the remaining values to 0. That is, only the top-k elements with the largest absolute values are kept and rest are made 0.
  
 
Authors describe conversion of the traditional back-propagation to the back-propagation used in meProp with a computation unit with one linear transformation and one non-linear transformation as an example as given by eq. (1) and (2):
 
Authors describe conversion of the traditional back-propagation to the back-propagation used in meProp with a computation unit with one linear transformation and one non-linear transformation as an example as given by eq. (1) and (2):
Line 90: Line 90:
  
 
'''Where to apply meProp:'''
 
'''Where to apply meProp:'''
In the learning task Matrix to Matrix and Matrix to Vector multiplications consumes more than 90% time of back propagation. Places where we apply meProp:
+
In the learning task Matrix to Matrix and Matrix to Vector multiplications consumes more than 90% time of back propagation. Places where the authors apply meProp:
-we apply meProp only to the back propagation from the output of the multiplication to its inputs.
+
-they apply meProp only to the back propagation from the output of the multiplication to its inputs.
-we apply meProp (meaning top-k sparsification) to every hidden layer . Because the sparsified gradient will again be dense from one layer to another.
+
-they apply meProp (meaning top-k sparsification) to every hidden layer. Because the sparsified gradient will again be dense from one layer to another.
-k for the output layer could be different from the k of the hidden layers because of the difference in dimensionality of the output layers. Example: For example, there are 10 tags in the MNIST task, so the dimension of the output layer is 10, and we use an MLP with the hidden dimension of 500. Thus, the best k for the output layer could be different from that of the hidden layers.
+
-k for the output layer could be different from the k of the hidden layers because of the difference in the dimensionality of the output layers. Example: For example, there are 10 tags in the MNIST task, so the dimension of the output layer is 10, and we use an MLP with the hidden dimension of 500. Thus, the best k for the output layer could be different from that of the hidden layers.
  
'''Exception:''' For other element-wise operations (e.g., activation functions), the original back propagation procedure remains as it is, be-cause those operations are already fast enough compared
+
'''Exception:''' For other element-wise operations (e.g., activation functions), the original backpropagation procedure remains as it is, because those operations are already fast enough compared
 
with matrix-matrix or matrix-vector multiplication operations.
 
with matrix-matrix or matrix-vector multiplication operations.
  
'''Choice of top-k algorithms:''' A variant (focusing on memory reuse) of min heap based top-k selection method is used. The time complexity is: O(n log k) and space complexity is O(k). This is done to save time on sorting the entire vector. A min-heap is a binary tree such that the data contained in each node is less than (or equal to) the data in that node’s children.
+
'''Choice of top-k algorithms:''' A variant (focusing on memory reuse) of min heap-based top-k selection method is used. The time complexity is: O(n log k) and space complexity is O(k). This is done to save time on sorting the entire vector. A min-heap is a binary tree such that the data contained in each node is less than (or equal to) the data in that node’s children.
  
 
=Experiments and Configurations=
 
=Experiments and Configurations=

Revision as of 14:37, 5 November 2017

Introduction

A simple and effective technique for learning in neural networks has been introduced. After forward computation has been computed as it is typically computed, in back-propagation step only the top k elements (in magnitude) are retained for the computation of model parameters. Since only a small subset of the weight matrix is modified so it leads to a linear reduction in the computational cost. The results of the experiments suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).

20.png

In the process of neural network training, back propagation step entails the high computational cost because for each iteration it needs to compute full gradients and update all model parameters. The main idea of the paper is to find only the most critical and small subset of the gradient information and in each learning step update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched. Fig. 1 shows an abstract view of the proposed approach.

Two important questions:

1) In the process of stochastic learning how to find that highly relevant subset of parameters from the current sample.

Solution to this question is given by Top-k search method to identify the most imp. parameters. After experiments with this solution the results demonstrate that we only update 1–4% of the weights at each back propagation pass which does not result in a larger number of training iterations.

2) Whether this selection of small subset of model parameters hurt the accuracy?

The results demonstrate that rather than reduction, this sparsification actually improves the accuracy. The inferred reason is that it is probably because the minimal effort update does not modify weakly relevant para-meters in each update, which makes the overfitting less likely, similar to the dropout effect.

The authors demonstrate the proposed approach using deep learning approaches (like LSTM, MLP), optimization approaches (like Adam and Adagrad) and tasks like NLP and Image Recognition.

Related Work

Some of the notable related work to this paper are as follows:

In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning. They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly instable. In addition, the algorithm is very insensitive to the choice of parameter values, and has excellent scaling properties.

In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as the following: when the partial derivative of error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, increase otherwise.

In 2014, Srivastava et al.[3] proposed dropout. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.

The work proposed by the authors of meProp method is quite different than the three related works discussed above.

In 2017, Shazeer et al.[4] presented a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. They used this approach for the machine translation task and concluded that it gave significantly better results. Their method is limited to a specific set of mixture of experts however, the meProp method does not have these sort of limitations.

Proposed Approach

7.png

The original back propagation computes the "full gradient" for the input vector and the weight matrix. However, For me-Prop, back propagation computes an "approximate gradient" by keeping top-k values of the backward flowed gradient and masking the remaining values to 0. That is, only the top-k elements with the largest absolute values are kept and rest are made 0.

Authors describe conversion of the traditional back-propagation to the back-propagation used in meProp with a computation unit with one linear transformation and one non-linear transformation as an example as given by eq. (1) and (2): \begin{align*} y &= W x \quad \quad \quad (1) \end{align*} \begin{align*} z &= \sigma (y) \quad \quad \quad (2) \end{align*} where m is the dimension of the input vector, n is the dimension of the output vector, and σ is a non-linear function (e.g., relu, tanh, and sigmoid). During back propagation, we need to compute the gradient of the parameter matrix W and the input vector x:

\[ \frac{\partial z}{\partial W_{ij}} = \sigma^{'}_{i}x^{T}_{j},\quad i \quad \epsilon \quad [1,n], \quad j \quad \epsilon \quad [1,m] \quad \quad \quad (3) \]

\[ \frac{\partial z}{\partial x_{i}} = \sum\limits_j W^{T}_{ij} \sigma^{'}_{j},\quad i \quad \epsilon \quad [1,n], \quad j \quad \epsilon \quad [1,m] \quad \quad \quad (4) \]

Since the proposed meProp keeps only top-k elements based on the magnitude values so eq. (3) and (4) get transformed to (5) and (6), respectively:

\[ \frac{\partial z}{\partial W_{ij}} \leftarrow \sigma^{'}_{i}x^{T}_{j}, \quad if \quad i \quad \epsilon \quad \{ t_{1}, t_{2},....., t_{k} \} \quad else \quad 0 \quad \quad (5) \]

\[ \frac{\partial z}{\partial x_{i}} \leftarrow \sum\limits_j W^{T}_{ij} \sigma^{'}_{j}, \quad if \quad j \quad \epsilon \quad \{ t_{1}, t_{2},....., t_{k} \} \quad else \quad 0 \quad \quad (6) \]


The original back-propagation computes the gradient of the matrix W and the gradient of the input vector x as shown in eq. (7) and (8), respectively:

\[ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} . \frac{\partial y}{\partial W} \quad \quad (7) \]

\[ \frac{\partial L}{\partial x} = \frac{\partial y}{\partial x} . \frac{\partial L}{\partial y} \quad \quad (8) \]

Since, the proposed meProp selects top-k elements of the traditional gradient to approximate it, hence the gradient of W and gradient of vector x transforms to the one shown in eq. (9) and (10):


\[ \frac{\partial L}{\partial W} \leftarrow top_{k}(\frac{\partial L}{\partial y}) . \frac{\partial y}{\partial W} \quad \quad (9) \]

\[ \frac{\partial L}{\partial x} \leftarrow \frac{\partial y}{\partial x} . top_{k}(\frac{\partial L}{\partial y}) \quad \quad (10) \]

The intuition behind the discussed conversions is depicted in Fig. 2.

Where to apply meProp: In the learning task Matrix to Matrix and Matrix to Vector multiplications consumes more than 90% time of back propagation. Places where the authors apply meProp: -they apply meProp only to the back propagation from the output of the multiplication to its inputs. -they apply meProp (meaning top-k sparsification) to every hidden layer. Because the sparsified gradient will again be dense from one layer to another. -k for the output layer could be different from the k of the hidden layers because of the difference in the dimensionality of the output layers. Example: For example, there are 10 tags in the MNIST task, so the dimension of the output layer is 10, and we use an MLP with the hidden dimension of 500. Thus, the best k for the output layer could be different from that of the hidden layers.

Exception: For other element-wise operations (e.g., activation functions), the original backpropagation procedure remains as it is, because those operations are already fast enough compared with matrix-matrix or matrix-vector multiplication operations.

Choice of top-k algorithms: A variant (focusing on memory reuse) of min heap-based top-k selection method is used. The time complexity is: O(n log k) and space complexity is O(k). This is done to save time on sorting the entire vector. A min-heap is a binary tree such that the data contained in each node is less than (or equal to) the data in that node’s children.

Experiments and Configurations

To establish that the approach is general purpose, authors performed experiments on different models (i.e. LSTM, MLP), different training methods (i.e. Adam, Adagrad) and different problem sets (i.e. Part of Speech Tagging, Transition based dependency parsing, MNIST Image Recognition).

POS-Tag: Baseline model: LSTM. Benchmark dataset: Penn Treebank Corpus. For training and testing: Wall Street Journal.

Parsing: Baseline model: MLP. Benchmark dataset: Penn Treebank Corpus. For training, development and testing: Wall Street Journal.

MNIST: Baseline model: MLP. For training, development and testing: MNIST dataset.

In the configuration for Parsing and MNIST authors use same k for the output and hidden layers. For POS-Tag authors use different k for the output and hidden layers. Due to low dimensionality of output layer in POS-Tag meProp isn't applied to it.

Results

13.png

meProp is applied to the linear transformations which actually entail the major computaitonal cost. Authors call linear transformation related backprop time as Backprop Time. It does not include the time required for non linear activations which usually entail less than 2% of the computational cost. The total time of back propagation including non linear activations is reported as Overall Backprop Time.

Through results it was observed that meProp substantially speeds up the backpropagation and provides a linear reduction in computational cost. Authors state the main reason for this reduction to be that meProp does not modify weakly relevant parameters, which makes overfitting less likely similar to the dropout effect. Also, the results depict that the proposed approach is independent of specific optimization methods.

The graphs shown in Fig. 4 depict that meProp addresses the problem of overfitting and it provides better accuracy if the top-k weights are selected instead of random weights. Also, it was inferred that meProp can achieve further improvements over dropout for reducing overfitting and a model should take advantage of both meProp and dropout to reduce overfitting. Adding hidden layers does not hurt the performance of the model. The relevant numerical results have been shown in table 1-5.

Further speed up: For further speeding up the backpropagation on GPUs authors presented a simple unified top-k approach (implementation in PyTorch). The main idea is to treat the entire mini-batch as a "big training example" where top-k operation is based on the averaged values of all examples in the mini-batch. The relevant numerical results have been shown in table 7 and 8.

List of Tables

11.png
12.png
14.png
15.png
16.png
56.png
57.png

Critiques

The main idea behind meProp is to wipe out the backprop mechanism of (n-k) nodes where "n" is the number of nodes in the current layer and "k" is the number of nodes contributing to maximum of the loss in that layer. Referring to equation 10, \[ \frac{\partial L}{\partial x} \leftarrow \frac{\partial y}{\partial x} . top_{k}(\frac{\partial L}{\partial y}) \quad \quad \]

  1. The authors have not proposed any method on how k should be selected, hence it is left to the reader's discretion to possibly take it as a hyperparameter. If so, in a deep layered architecture, where the weights between each layer are randomly initialized during each execution, "k" might change for each layer since the features learnt at each layer may not be the same from the previous layers
  2. If the sum of losses caused by the (n-k) nodes in the current layer exceed any of the losses incurred due to "k" nodes, then it would not be correct to drop the (n-k) nodes as we can assume the aggregate (n-k) nodes as a single opaque node with a composite weight which will incur an aggregated loss greater that any of the "k" nodes.

In essence, the idea of selecting "k" nodes to drop-out prove to be effective as shown by the authors, but the lack of information on the conditions on selecting "k" for each layer given the current state of the layer might lead to lack of consistency in the results.

In addition to this, the authors did not include convolutional neural networks in their experiments. It would have been interesting to see whether similar results were observed on that architecture. Theoretically, the method presented in this paper should only update kernels in parts of an image that contribute the most to the loss.

References

  1. Tollenaere, Tom. "SuperSAB: fast adaptive back propagation with good scaling properties." Neural networks 3.5 (1990): 561-573.
  2. Riedmiller, Martin, and Heinrich Braun. "A direct adaptive method for faster backpropagation learning: The RPROP algorithm." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.
  3. Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research 15.1 (2014): 1929-1958.
  4. Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).