natural language processing (almost) from scratch.
[Please don't edit yet - updating section-by-section]
Overview
Benchmark Tasks
Part of Speech Labelling (POS)
Chunking (CHUNK)
Named Entity Recognition (NER)
Semantic Role Labelling (SLR)
Network Design
Words are preprocessed by lower-casing and encoding capitalization as an additional feature.
Notation and Hyper-Parameters
[math]\displaystyle{ \,f_\theta }[/math](.) denotes any neural network with parameters [math]\displaystyle{ \,\theta }[/math]. Any feed-forward neural network with L layers can be written as a composition of functions as follows: [math]\displaystyle{ \,f_\theta = f_\theta^L(f_\theta^{L-1}(...f_\theta^1(.)...)) }[/math].
[math]\displaystyle{ \,[A]_{i,j} }[/math] denotes the element of a matrix A at row i and column j.
[math]\displaystyle{ \,\langle A\rangle_i^{d_{win}} }[/math] denotes the matrix obtained by concatenating the dwin columns vectors around the ith column vector of a matrix A ∈ Rd1xd2. i.e., [math]\displaystyle{ \,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,i-d_{win}/2} ... [A]_{d_1,i-d_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2}) }[/math].
[math]\displaystyle{ \,\langle A\rangle_i^1 }[/math] is a special case which denotes the ith column vector of A.
[math]\displaystyle{ \,[x]_1^T }[/math] denotes the sequences of elements [math]\displaystyle{ \left\{ x_1, x_2, ... , x_T \right\} }[/math] and [math]\displaystyle{ [x]_i }[/math] denotes the ith element of the sequence.
[math]\displaystyle{ \,d_{wrd} }[/math] is the word vector size, given by the user.
[math]\displaystyle{ \,d_{wrd}^k }[/math] is the vector size for a feature k, given by the user.
[math]\displaystyle{ \,k_{sz} }[/math] is the window size, given by the user.
[math]\displaystyle{ \,n_{hu}^l }[/math] is the number of hidden units for the lth layer, given by the user.
Feature Vectors
Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.
For each word w ∈ D the lookup table layer LTw(.) gives a dwrd-dimensional feature vector representation:
- [math]\displaystyle{ LT_W(.) = \langle W\rangle_w^1 }[/math]
Where W ∈ Rdwrdx|D|. is a matrix of parameters to be learnt.
For a sequence of words [math]\displaystyle{ \,[x]_1^T }[/math], the lookup table layer produces the following output matrix:
- [math]\displaystyle{ LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1) }[/math]
To add additional features, we can represent the word as K discrete features w ∈ D1×...×Dk, where Dk is the dictionary for the kth feature. Each feature is associated with a lookup table LTwk(.) with parameters Wk ∈ Rdwrdx|D|, and the outputs are concatenated to form the final feature vector.
- [math]\displaystyle{ LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ \vdots \\ LT_{w^2}(w_2)\end{array} \right) = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ \vdots \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) }[/math]
Windowed Approach
This approach assumes the tag for a word depends mostly on its surrounding words. Given a word to be tagged, we feed a window of ksz words around the word of interest to the lookup table layer, which outputs a matrix of size dwrd×ksz. Finally, the columns of this output matrix are concatenated to use as input to the next layer.
- [math]\displaystyle{ f_\theta^1 = \langle LT_W([w]_1^T) \rangle _t ^{d_{win}} = \left( \begin{array}{c} \langle W\rangle_{[w]_{t-d_{win}/2}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t+d_{win}/2}}^1 \end{array} \right) }[/math]
The output from the first layer is fed to one or more linear layers:
- [math]\displaystyle{ f_\theta^l = \langle W \rangle^l f_\theta^{l-1} + b^l }[/math]
Where [math]\displaystyle{ \,W^l \in R^{n_{hu}^l × n_{hu}^l-1} }[/math] and [math]\displaystyle{ b^l \in R^{n_{hu}^l} }[/math] are parameters to be trained.
Linear layers may be interleaved with layers which apply a hard hyperbolic tangent function over their inputs:
- [math]\displaystyle{ \left[f_\theta^l\right]_i = HardTanh(\left[f_\theta^{l-1} \right]_i) }[/math]
Where HardTanh is defined as:
- [math]\displaystyle{ HardTanh(x) = \left\{ \begin{array}{l} -1\ if\ x \lt -1 \\ x\ if\ -1 \leq x \leq 1 \\ 1\ if\ x \gt 1 \end{array} \right. }[/math]
Finally, the output size of the last layer is the number of tags for a particular task, and each output entry can be interpreted as a score for that tag.
Sentence-Level Approach
Results
Unlabelled Data
Discussion
Task-Specific Engineering
References
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.