generating text with recurrent neural networks
Introduction
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully.
The problem solved by language modelling involves predicting the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this problem, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer', and a mixture of context models referred to as PAQ, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression.
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, [math]\displaystyle{ \ W_{hi} }[/math], [math]\displaystyle{ \ W_{hh} }[/math], and [math]\displaystyle{ \ W_{oh} }[/math], and functions to map a sequence of [math]\displaystyle{ N }[/math] input states [math]\displaystyle{ \ [i_1, ... , i_N] }[/math] to a sequence of hidden states [math]\displaystyle{ \ [h_1, ... , h_N] }[/math] and a sequence of output states [math]\displaystyle{ \ [o_1, ... , o_N] }[/math]. The matrix [math]\displaystyle{ \ W_{hi} }[/math] parameterizes the mapping from the current input state to the current hidden state, while the matrix [math]\displaystyle{ \ W_{hh} }[/math] parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix [math]\displaystyle{ \ W_{oh} }[/math] parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep [math]\displaystyle{ t }[/math], the values of the hidden state and output state are as follows:
- [math]\displaystyle{ \ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) }[/math]
- [math]\displaystyle{ \ o_t = W_{oh}h_t + b_o }[/math]
Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameter and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (which can lead to exploding gradients) and nearly flat plateaus (which lead to vanishing gradients) that can effectively become poor local minima that a network trained through gradient descent can converge upon. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication Science, 204.5667 (2004): 78-80 </ref> which learns only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns.
Hessian-Free Optimization
While this optimization technique is described elsewhere in Martens (2010), its use is essential to obtaining the authors' successful results in this paper.
Multiplicative Recurrent Neural Networks
Quantitative Experiments
Qualitative Experiments
Discussion
Bibliography
<references />