natural language processing (almost) from scratch.
[Please don't edit yet - updating section-by-section]
Overview
Benchmark Tasks
Part of Speech Labelling (POS)
Chunking (CHUNK)
Named Entity Recognition (NER)
Semantic Role Labelling (SLR)
Network Design
Words are preprocessed by lower-casing and encoding capitalization as an additional feature.
Notation and Hyper-Parameters
[math]\displaystyle{ \,f_\theta }[/math](.) denotes any neural network with parameters [math]\displaystyle{ \,\theta }[/math]. Any feed-forward neural network with L layers can be written as a composition of functions as follows: [math]\displaystyle{ \,f_\theta = f_\theta^L(f_\theta^{L-1}(...f_\theta^1(.)...)) }[/math].
[math]\displaystyle{ \,[A]_{i,j} }[/math] denotes the element of a matrix A at row i and column j.
[math]\displaystyle{ \,\langle A\rangle_i^{d_{win}} }[/math] denotes the matrix obtained by concatenating the dwin columns vectors around the ith column vector of a matrix A ∈ Rd1xd2. i.e., [math]\displaystyle{ \,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,i-d_{win}/2} ... [A]_{d_1,i-d_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2}) }[/math].
[math]\displaystyle{ \,\langle A\rangle_i^1 }[/math] is a special case which denotes the ith column vector of A.
[math]\displaystyle{ \,[x]_1^T }[/math] denotes the sequences of elements [math]\displaystyle{ \left\{ x_1, x_2, ... , x_T \right\} }[/math] and [math]\displaystyle{ [x]_i }[/math] denotes the ith element of the sequence.
[math]\displaystyle{ \,d_{wrd} }[/math] is the word vector size, given by the user.
[math]\displaystyle{ \,d_{wrd}^k }[/math] is the vector size for a feature k, given by the user.
Feature Vectors
Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.
For each word w ∈ D the lookup table layer LTw(.) gives a dwrd-dimensional feature vector representation:
- [math]\displaystyle{ LT_W(.) = \langle W\rangle_w^1 }[/math]
Where W ∈ Rdwrdx|D|. is a matrix of parameters to be learnt.
For a sequence of words [math]\displaystyle{ \,[x]_1^T }[/math], the lookup table layer produces the following output matrix:
- [math]\displaystyle{ LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1) }[/math]
To add additional features, we can represent the word as K discrete features w ∈ D1×...×Dk, where Dk is the dictionary for the kth feature. Each feature is associated with a lookup table LTwk(.) with parameters Wk ∈ Rdwrdx|D|, and the outputs are concatenated to form the final feature vector.
- [math]\displaystyle{ LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ \vdots \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ \vdots \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) }[/math]
Windowed Approach
Sentence-Level Approach
Results
Unlabelled Data
Discussion
Task-Specific Engineering
References
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.