Difference between revisions of "natural language processing (almost) from scratch."
(→Network Design) 
m (Conversion script moved page Natural language processing (almost) from scratch. to natural language processing (almost) from scratch.: Converting page titles to lowercase) 
(No difference)

Latest revision as of 09:46, 30 August 2017
Contents
Overview
Most current systems address a particular natural language processing (NLP) task by applying standard statistical models to adhoc, taskspecific features. These features are often output of preexisting systems. The authors argue that the goal of NLP is, ideally, is to translate natural language text into a data structure which fully and unambiguously describes the meaning of the task. In this light, the drawback of existing approaches is that their representation of the text is optimized for a specific task and is not necessarily helpful in developing a more general representation.
This paper describes creates a single learning system to perform four standard NLP tasks(described below), using minimal preprocessing and minimal existing linguistic knowledge.
Benchmark Tasks
Four standard NLP tasks were chosen for testing and comparison with other systems: part of speech labelling (POS), chunking (CHUNK), named entity recognition (NER), and semantic role labelling (SRL).
Part of Speech Labelling (POS)
The goal of part of speech labelling is to tag each word with its semantic role (verb, plural noun, adverb, etc.) The experimental setup is borrowed from Toutonova et. al (2003). Sections 018 of the Wall Street Journal (WSJ) data are used for training, sections 1921 for validation, and sections 2224 for testing.
Chunking (CHUNK)
Chunking (CHUNK) or shallow parsing is intended to break a sentence into its syntactic components (noun phrase, verb phrase, etc.). Each word is tagged with the single component it belongs to, sometimes encoded with separate tags for a word at the beginning of the component and a word in the middle of the component. Experimental setup was taken from the CONLL 2000 task and uses WSJ data.
Named Entity Recognition (NER)
In named entity recognition, words are tagged with a relevant category such as "person" or "location", and additionally, the tags reflect whether a word begins or is inside an entity. Experimental setup is taken from the relevant CONL 2003 task and uses Reuters data.
Semantic Role Labelling (SRL)
SRL aims to give a semantic role to a component of a sentence (in this case, identifying predicates of a verb). Words may have multiple tags if the sentence has multiple verbs. Stateoftheart SRL systems consist of several stages: producing a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags. SRL was evaluated using the CoNLL 2005 shared task.
Network Design
Traditional NLP systems use handdesigned, taskspecific features relying both on linguistic intuition and empirical testing as input to standard classification algorithms (for example, support vector machines). The approach in this paper is to minimally preprocess features and instead use a neural network trained endtoend. Here the preprocess step of features is as little as possible and then a multilayer neural network (NN) architecture is used which is trained in an endtoend fashion. The architecture takes the input sentence and learns several layers of feature extraction that process the inputs. The features computed by the deep layers of the network are automatically trained by backpropagation to be relevant to the task. A general multilayer architecture suitable for all NLP tasks is described, which is generalizable to other NLP tasks as well.
Notation and HyperParameters
[math]\,f_\theta[/math](.) denotes any neural network with parameters [math]\,\theta[/math]. Any feedforward neural network with L layers can be written as a composition of functions as follows: [math]\,f_\theta = f_\theta^L(f_\theta^{L1}(...f_\theta^1(.)...))[/math].
[math]\,[A]_{i,j}[/math] denotes the element of a matrix A at row i and column j.
[math]\,\langle A\rangle_i^{d_{win}}[/math] denotes the matrix obtained by concatenating the d_{win} columns vectors around the i^{th} column vector of a matrix A ∈ R^{d1xd2}. i.e., [math]\,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,id_{win}/2} ... [A]_{d_1,id_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2})[/math].
[math]\,\langle A\rangle_i^1[/math] is a special case which denotes the i^{th} column vector of A.
[math]\,[x]_1^T[/math] denotes the sequences of elements [math]\left\{ x_1, x_2, ... , x_T \right\}[/math] and [math][x]_i[/math] denotes the i^{th} element of the sequence.
[math]\,d_{wrd}[/math] is the word vector size, given by the user.
[math]\,d_{wrd}^k[/math] is the vector size for a feature k, given by the user.
[math]\,k_{sz}[/math] is the window size, given by the user.
[math]\,n_{hu}^l[/math] is the number of hidden units for the l^{th} layer, given by the user.
Feature Vectors
Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.
For each word w ∈ D the lookup table layer LT_{w}(.) gives a d_{wrd}dimensional feature vector representation:
 [math]LT_W(.) = \langle W\rangle_w^1[/math]
Where W ∈ R^{dwrdxD}. is a matrix of parameters to be learnt.
For a sequence of words [math]\,[x]_1^T[/math], the lookup table layer produces the following output matrix:
 [math]LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1)[/math]
To add additional features, we can represent the word as K discrete features w ∈ D_{1}×...×D_{k}, where D_{k} is the dictionary for the k^{th} feature. Each feature is associated with a lookup table LT_{w}^{k}(.) with parameters W^{k} ∈ R^{dwrdxD}, and the outputs are concatenated to form the final feature vector.
 [math]LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ \vdots \\ LT_{w^2}(w_2)\end{array} \right) = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ \vdots \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) [/math]
Windowed Approach
This approach assumes the tag for a word depends mostly on its surrounding words. Given a word to be tagged, we feed a window of k_{sz} words around the word of interest to the lookup table layer, which outputs a matrix of size d_{wrd}×k_{sz}. Finally, the columns of this output matrix are concatenated to use as input to the next layer.
 [math] f_\theta^1 = \langle LT_W([w]_1^T) \rangle _t ^{d_{win}} = \left( \begin{array}{c} \langle W\rangle_{[w]_{td_{win}/2}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t+d_{win}/2}}^1 \end{array} \right) [/math]
The output from the first layer is fed to one or more linear layers:
 [math] f_\theta^l = \langle W \rangle^l f_\theta^{l1} + b^l[/math]
Where [math]\,W^l \in R^{n_{hu}^l × n_{hu}^l1}[/math] and [math]b^l \in R^{n_{hu}^l}[/math] are parameters to be trained.
Linear layers may be interleaved with layers which apply a hard hyperbolic tangent function over their inputs:
 [math]\left[f_\theta^l\right]_i = HardTanh(\left[f_\theta^{l1} \right]_i)[/math]
Where HardTanh is defined as:
 [math]HardTanh(x) = \left\{ \begin{array}{l} 1\ if\ x \lt 1 \\ x\ if\ 1 \leq x \leq 1 \\ 1\ if\ x \gt 1 \end{array} \right. [/math]
Finally, the output size of the last layer is the number of tags for a particular task, and each output entry can be interpreted as a score for that tag.
SentenceLevel Approach
The windowed approach described above is appropriate for tasks where only information about the surrounding words is needed, but is not appropriate for tasks like SRL where the appropriate tag for a word may depend on the whole sentence. A convolutional approach is taken to solve this problem.
The convolutional network produces features for each word in the sentence and for each verb in the sentence. For each word i, two extra features are added which encode the distance between i and the word to tag (i  pos_{w}), and i and the verb being considered (i  pos_{v}).
We can view this as generalizing the previous windowed approach. The t^{th} column of the l^{th} layer is given by:
 [math] \langle f_\theta^l \rangle _t^1 = \langle W \rangle^l \langle f_\theta^{l1} \rangle _t^{d_{win}} + b^l[/math]
Where W^{l} is the same for all windows in the sequence. Convolutional layers may be stacked, in which case they must be separated by a nonlinearity
The convolutional layers are followed by a max layer to force the output into a fixedlength feature vector:
 [math] \left[ f_\theta^l \right]_i = \max_t \left[ f_\theta^{l1} \right]_{i,t} \ \ \ \ 1 \leq i \leq n_{hu}^{l1} [/math]
The output of the max layer may then be used as input to standard linear layers (as described in the previous section).
Results
Approach  POS (PWA)  CHUNK (F1)  NER (F1)  SRL (F1) 

Benchmark Systems  97.24  94.29  89.31  77.92 
NN+WLL  96.31  89.13  79.53  55.40 
NN+SLL  96.37  90.33  81.47  70.99 
The results given above compared two versions of the neural nets described above (one using wordlevel log likelihood for training and the other using sentencelevel log likelihood) to benchmark results. Perword accuracy is reported for the POS task, and the F1 score is reported for all other tasks.
Unlabelled Data
The above method produces results that are below the benchmark results. To improve results, the authors use the entire English language wikipedia (with markup removed, and excluding any paragraph containing nonroman characters), along with Reuters RCV1 data (Lewis et. al, 2004). The dictionary was also expanded by adding the 30,000 most common words from the Reuters dataset.
The unlabelled data was used to train language models which score the acceptability of a piece of text (using the windowed approach describe previously). The goal was to train a model which will score a correct phrase more highly than an incorrect one. The following pairwise ranking criteria was used:
 [math] f_\theta \rightarrow \sum_{x \in x} \sum_{w \in D} \max \{ 0, 1  f_\theta(x) + f_\theta(x^{\left(w \right)}) \} [/math]
Where x is the text window being considered, X is the set of all possible text windows with d_{win} coming from the training set, D is the dictionary, and x^{(w)<\sup>} is the text window obtained by replacing the central word of the text window with w. The language model was obtained by minimizing the above ranking criterion using stochastic gradient descent.
SemiSupervised Results
Approach  POS (PWA)  CHUNK (F1)  NER (F1)  SRL (F1) 

Benchmark Systems  97.24  94.29  89.31  77.92 
NN+WLL  96.31  89.13  79.53  55.40 
NN+SLL  96.37  90.33  81.47  70.99 
NN+WLL+LM1  97.05  91.91  85.68  58.18 
NN+SLL+LM1  97.10  93.65  87.58  73.84 
NN+WLL+LM2  97.14  92.04  86.96  58.34 
NN+SLL+LM2  97.20  93.63  88.67  74.15 
Perword accuracy is reported for the POS task, and the F1 score is reported for all other tasks. For each network, two versions, one using wordlevel log likelihood (WLL) for training and the other using sentencelevel log likelihood(SLL), were tested.
LM1 refers to the language model trained on the data from the Englishlanguage wikipedia dataset. LM2 is the language model initialized from LM1 and additional trained on the Reuters dataset.
Discussion
 The multilayer neural networks described are around a decade old, and its feasibility depends not on modern methods but on modern hardware. However, a benefit is that the learning algorithm scales linearly with the number of examples.
 Considerable existing information about good NLP features for specific tasks already exists. The authors argue that a taskindependent approach is better because no single task can provide a complete representation of a text.
 Good results are only possible with a large amount of unlabelled data, and the resulting language model was very timeconsuming to train (approximately a month for one language model and approximately seven weeks for the other).
References
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:24932537, 2011.
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Featurerich partofspeech taggingwith a cyclic dependency network. InHLTNAACL, 2003.
D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361397, 2004.