on the difficulty of training recurrent neural networks

From statwiki
Jump to navigation Jump to search

Introduction

Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages 1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.

Background

File:rnn 2.png
Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>

A generic recurrent neural network has the form:

[math]\displaystyle{ x_{t} = F(x_{t -1}, u_{t}, \theta)\, }[/math]

Where:

  • [math]\displaystyle{ x_{t} }[/math]: is the state at time [math]\displaystyle{ t }[/math]
  • [math]\displaystyle{ u_{t} }[/math]: is the input at time [math]\displaystyle{ t }[/math]
  • [math]\displaystyle{ \theta }[/math]: is the parameters
  • [math]\displaystyle{ F() }[/math]: is the function that represents a neuron

In the theoreical sections the authors made use of specific parameterization:

[math]\displaystyle{ x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b\, }[/math]

Where:

  • [math]\displaystyle{ W_{rec} }[/math]: is the RNN weights matrix
  • [math]\displaystyle{ \sigma }[/math]: is an element wise function
  • [math]\displaystyle{ b }[/math]: is the bias
  • [math]\displaystyle{ W_{in} }[/math]: is the input weights matrix

The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:

[math]\displaystyle{ \frac{\partial \varepsilon}{\partial \theta} = \sum_{1 \leq t \leq T} \frac{\partial \varepsilon}{\partial \theta} }[/math]

[math]\displaystyle{ \frac{\partial \varepsilon_{t}}{\partial \theta} = \sum_{1 \leq k \leq T} \left( \frac{\partial \varepsilon_{t}}{\partial x_{t}} \frac{\partial x_{t}}{\partial x_{k}} \frac{\partial^{+} x_{k}}{\partial \theta} \right) }[/math]

[math]\displaystyle{ \frac{\partial x_{t}}{\partial x_{k}} = \prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} = \prod_{t \leq i \leq k} W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1})) }[/math]

Where:

  • [math]\displaystyle{ \varepsilon_{t} }[/math]: is the error obtained from output at time [math]\displaystyle{ t }[/math]
  • [math]\displaystyle{ \frac{\partial^{+} x_{k}}{\partial \theta} }[/math]: is the immediate partial derivative of state [math]\displaystyle{ x_{k} }[/math]

Exploding and Vanishing Gradients

The Mechanics

It's known that [math]\displaystyle{ |\sigma^'(x)| }[/math] is bounded. Let [math]\displaystyle{ ||diag(\sigma^'(x_k))|| \leq \gamma \in R }[/math].

The paper first proves that it is sufficient for [math]\displaystyle{ \lambda_1 \lt \frac{1}{\gamma} }[/math], where [math]\displaystyle{ \lambda_1 }[/math] is the largest singular value of [math]\displaystyle{ \bold{W}_{rec} }[/math], for the vanishing gradient problem to occur. The Jacobian matrix [math]\displaystyle{ \frac{\partial x_{k+1}}{\partial x_k} }[/math] is given by [math]\displaystyle{ \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) }[/math]. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to [math]\displaystyle{ \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| \lt \frac{1}{\gamma}\gamma \lt 1 }[/math]

Let [math]\displaystyle{ \eta \in R }[/math] be such that [math]\displaystyle{ \forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta \lt 1 }[/math]. By induction over [math]\displaystyle{ i }[/math], we can show that [math]\displaystyle{ ||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}|| }[/math]. Since [math]\displaystyle{ \eta \lt 1 }[/math], as [math]\displaystyle{ t-k }[/math] goes larger, the gradient goes to 0.

By inverting this proof, it also shows that when the largest singular value [math]\displaystyle{ \lambda_1 }[/math] is larger than [math]\displaystyle{ \frac{1}{\gamma} }[/math], we will have exploding gradients (otherwise the long term components would vanish instead of exploding).

From a dynamical systems perspective

Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “[math]\displaystyle{ \theta }[/math] changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.

The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.

File:dynamic perspective.png
Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>

The figure above depicts argument, where the x-axis is the parameter [math]\displaystyle{ b }[/math] (bias) and the y-axis is the asymptotoc state [math]\displaystyle{ x_{\infty} }[/math], the plot line is the momvent of the final point attractor [math]\displaystyle{ x_{\infty} }[/math] as [math]\displaystyle{ b }[/math] changes. What this figure represents is the presence of two attractors, one emerging from [math]\displaystyle{ b_1 }[/math] and another disappearing at [math]\displaystyle{ b_2 }[/math]. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in [math]\displaystyle{ \theta }[/math] could (50% chance) cause [math]\displaystyle{ x }[/math] to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time [math]\displaystyle{ 0 }[/math], a small change in [math]\displaystyle{ \theta }[/math] would result in a sudden large change in [math]\displaystyle{ x_{t} }[/math].

From a geometric perspective

Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.

File:geometric perspective.png
Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>

Reusing the RNN state equation the authors defined:

[math]\displaystyle{ x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b }[/math]

By assuming no input, with [math]\displaystyle{ b = 0 }[/math] and initial state [math]\displaystyle{ x_{0} }[/math], the equation simplifies to:

[math]\displaystyle{ x_{t} = W_{rec}^{t} x_{0} }[/math]

Differentiating the above equation to the first and second order would give:

[math]\displaystyle{ \frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0} }[/math]

[math]\displaystyle{ \frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0} }[/math]

Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).

Dealing with Exploding and Vanishing Gradient

Related Work

  • L1/L2 Regularization: Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.
  • Teacher Forcing: Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.
  • LTSM: The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of [math]\displaystyle{ 1 }[/math]. This solution however does not deal with the exploding gradient
  • Hessian-Free optimizer with structural damping: Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.
  • Echo State Networks: <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.

Scaling Down the Gradients

File:gradient clipping.png

The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.

Vanishing Gradient Regularization

The vanishing gradient regularizer is as follows:

[math]\displaystyle{ \Omega = \sum_{k} \Omega_{k} = \sum_{k} \left( \frac{ \| \frac{\delta \varepsilon}{\delta x_{k + 1}} \frac{\delta x_{k + 1}}{\delta x_{k}} \| } { \| \frac{\delta \varepsilon}{\delta x_{k + 1}} \| } - 1 \right)^{2} }[/math]

The vanishing gradient problem occurs when at time [math]\displaystyle{ t }[/math] the inputs [math]\displaystyle{ u }[/math] may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs [math]\displaystyle{ u_{t} \dots u_{k} }[/math] could be increased by increasing the norm of [math]\displaystyle{ \frac{\delta x_t}{\delta x_{t}} }[/math]. This imples that in order to increses the [math]\displaystyle{ \frac{\delta x_t}{\delta x_{t}} }[/math] norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices [math]\displaystyle{ \frac{\delta x_{k + 1}}{\delta x_{k}} }[/math] to preserve norm in the direction of the error [math]\displaystyle{ \frac{\delta \varepsilon}{\delta x_{k + 1}} }[/math].

Experimental Results

The Temporal Order Problem

The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an [math]\displaystyle{ A }[/math] or a [math]\displaystyle{ B }[/math] symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] at the end of the sequence.

Three different RNN intializations were performed for the experiment:

  • sigmoid unit network: [math]\displaystyle{ W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01) }[/math]
  • basic tanh unit network: [math]\displaystyle{ W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1) }[/math]
  • smart tanh unit network: [math]\displaystyle{ W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01) }[/math]

Of the three RNN networks three different optimizer configurations were used:

  • MSGD: Mini-batch Stochastic Gradient Decent
  • MSGD-C: MSGD with Gradient Clipping
  • MSGD-CR: MSGD-C with Regularization

Additional model parameters include:

  • Spectral Radius of 0.95
  • [math]\displaystyle{ b = b_{out} = 0 }[/math]
  • 50 hidden neurons
  • constant learning rate of 0.01
  • clipping threshold of 1 (only for MSGD-C and MSGD-CR)
  • regularization weight of 4 (MSGD-CR)

The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.

File:experimental results.png

Other Pathological Problems

The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:

image

The authors did not discuss the experimental results in detail here.

Summary

The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.