on the difficulty of training recurrent neural networks
Introduction
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages 1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.
Background
A generic recurrent neural network has the form:
[math]\displaystyle{ x_{t} = F(x_{t -1}, u_{t}, \theta) }[/math]
Where:
- [math]\displaystyle{ x_{t} }[/math]: is the state at time [math]\displaystyle{ t }[/math]
- [math]\displaystyle{ u_{t} }[/math]: is the input at time [math]\displaystyle{ t }[/math]
- [math]\displaystyle{ \theta }[/math]: is the parameters
- [math]\displaystyle{ F() }[/math]: is the function that represents a neuron
In the theoreical sections the authors made use of specific parameterization:
[math]\displaystyle{ x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b }[/math]
Where:
- [math]\displaystyle{ W_{rec} }[/math]: is the RNN weights matrix
- [math]\displaystyle{ \sigma }[/math]: is an element wise function
- [math]\displaystyle{ b }[/math]: is the bias
- [math]\displaystyle{ W_{in} }[/math]: is the input weights matrix
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:
[math]\displaystyle{ \frac{\delta \varepsilon}{\delta \theta} = \sum_{1 \leq t \leq T} \frac{\delta \varepsilon}{\delta \theta} }[/math]
[math]\displaystyle{ \frac{\delta \varepsilon_{t}}{\delta \theta} = \sum_{1 \leq k \leq T} \left( \frac{\delta \varepsilon_{t}}{\delta x_{t}} \frac{\delta x_{t}}{\delta x_{k}} \frac{\delta^{+} x_{k}}{\delta \theta} \right) }[/math]
[math]\displaystyle{ \frac{\delta x_{t}}{\delta x_{k}} = \prod_{t \leq i \leq k} \frac{\delta x_{i}}{\delta x_{i - 1}} = \prod_{t \leq i \leq k} W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1})) }[/math]
Where:
- [math]\displaystyle{ \varepsilon_{t} }[/math]: is the error obtained from output at time [math]\displaystyle{ t }[/math]
- [math]\displaystyle{ \frac{\delta^{+} x_{k}}{\delta \theta} }[/math]: is the immediate partial derivative of state [math]\displaystyle{ x_{k} }[/math]
Exploding and Vanishing Gradients
From a dynamical systems perspective
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “[math]\displaystyle{ \theta }[/math] changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.
The figure above depicts argument, where the x-axis is the parameter [math]\displaystyle{ b }[/math] (bias) and the y-axis is the asymptotoc state [math]\displaystyle{ x_{\infty} }[/math], the plot line is the momvent of the final point attractor [math]\displaystyle{ x_{\infty} }[/math] as [math]\displaystyle{ b }[/math] changes. What this figure represents is the presence of two attractors, one emerging from [math]\displaystyle{ b_1 }[/math] and another disappearing at [math]\displaystyle{ b_2 }[/math]. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in [math]\displaystyle{ \theta }[/math] could (50% chance) cause [math]\displaystyle{ x }[/math] to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time [math]\displaystyle{ 0 }[/math], a small change in [math]\displaystyle{ \theta }[/math] would result in a sudden large change in [math]\displaystyle{ x_{t} }[/math].
From a geometric perspective
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.
Reusing the RNN state equation the authors defined:
[math]\displaystyle{ x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b }[/math]
By assuming no input, with [math]\displaystyle{ b = 0 }[/math] and initial state [math]\displaystyle{ x_{0} }[/math], the equation simplifies to:
[math]\displaystyle{ x_{t} = W_{rec}^{t} x_{0} }[/math]
Differentiating the above equation to the first and second order would give:
[math]\displaystyle{ \frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0} }[/math]
[math]\displaystyle{ \frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0} }[/math]
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).
Dealing with Exploding and Vanishing Gradient
Related Work
- L1/L2 Regularization: Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.
- Teacher Forcing: Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.
- LTSM: The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of [math]\displaystyle{ 1 }[/math]. This solution however does not deal with the exploding gradient
- Hessian-Free optimizer with structural damping: Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.
- Echo State Networks: <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.
Scaling Down the Gradients
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.
Vanishing Gradient Regularization
The vanishing gradient regularizer is as follows:
[math]\displaystyle{ \Omega = \sum_{k} \Omega_{k} = \sum_{k} \left( \frac{ \| \frac{\delta \varepsilon}{\delta x_{k + 1}} \frac{\delta x_{k + 1}}{\delta x_{k}} \| } { \| \frac{\delta \varepsilon}{\delta x_{k + 1}} \| } - 1 \right)^{2} }[/math]
The vanishing gradient problem occurs when at time [math]\displaystyle{ t }[/math] the inputs [math]\displaystyle{ u }[/math] may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs [math]\displaystyle{ u_{t} \dots u_{k} }[/math] could be increased by increasing the norm of [math]\displaystyle{ \frac{\delta x_t}{\delta x_{t}} }[/math]. This imples that in order to increses the [math]\displaystyle{ \frac{\delta x_t}{\delta x_{t}} }[/math] norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices [math]\displaystyle{ \frac{\delta x_{k + 1}}{\delta x_{k}} }[/math] to preserve norm in the direction of the error [math]\displaystyle{ \frac{\delta \varepsilon}{\delta x_{k + 1}} }[/math].
Experimental Results
The Temporal Order Problem
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an [math]\displaystyle{ A }[/math] or a [math]\displaystyle{ B }[/math] symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] at the end of the sequence.
Three different RNN intializations were performed for the experiment:
- sigmoid unit network: [math]\displaystyle{ W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01) }[/math]
- basic tanh unit network: [math]\displaystyle{ W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1) }[/math]
- smart tanh unit network: [math]\displaystyle{ W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01) }[/math]
Of the three RNN networks three different optimizer configurations were used:
- MSGD: Mini-batch Stochastic Gradient Decent
- MSGD-C: MSGD with Gradient Clipping
- MSGD-CR: MSGD-C with Regularization
Additional model parameters include:
- Spectral Radius of 0.95
- [math]\displaystyle{ b = b_{out} = 0 }[/math]
- 50 hidden neurons
- constant learning rate of 0.01
- clipping threshold of 1 (only for MSGD-C and MSGD-CR)
- regularization weight of 4 (MSGD-CR)
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.
Other Pathological Problems
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:
The authors did not discuss the experimental results in detail here.
Summary
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.