natural language processing (almost) from scratch.: Difference between revisions
Line 6: | Line 6: | ||
== Part of Speech Labelling (POS) == | == Part of Speech Labelling (POS) == | ||
The goal of Part-Of-Speech labelling is to tag each word with its semantic role (verb, plural noun, adverb, etc.) Sections 0-18 of the Wall Street Journal (WSJ) data are used for training, sections 19-21 for validation, and sections 22-24 for testing. The experimental setup is borrowed from Toutonova et. al (2003). | |||
== Chunking (CHUNK) == | == Chunking (CHUNK) == |
Revision as of 12:29, 12 November 2015
[Please don't edit yet - updating section-by-section]
Overview
Benchmark Tasks
Part of Speech Labelling (POS)
The goal of Part-Of-Speech labelling is to tag each word with its semantic role (verb, plural noun, adverb, etc.) Sections 0-18 of the Wall Street Journal (WSJ) data are used for training, sections 19-21 for validation, and sections 22-24 for testing. The experimental setup is borrowed from Toutonova et. al (2003).
Chunking (CHUNK)
Named Entity Recognition (NER)
Semantic Role Labelling (SLR)
Network Design
Words are preprocessed by lower-casing and encoding capitalization as an additional feature.
Notation and Hyper-Parameters
[math]\displaystyle{ \,f_\theta }[/math](.) denotes any neural network with parameters [math]\displaystyle{ \,\theta }[/math]. Any feed-forward neural network with L layers can be written as a composition of functions as follows: [math]\displaystyle{ \,f_\theta = f_\theta^L(f_\theta^{L-1}(...f_\theta^1(.)...)) }[/math].
[math]\displaystyle{ \,[A]_{i,j} }[/math] denotes the element of a matrix A at row i and column j.
[math]\displaystyle{ \,\langle A\rangle_i^{d_{win}} }[/math] denotes the matrix obtained by concatenating the dwin columns vectors around the ith column vector of a matrix A ∈ Rd1xd2. i.e., [math]\displaystyle{ \,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,i-d_{win}/2} ... [A]_{d_1,i-d_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2}) }[/math].
[math]\displaystyle{ \,\langle A\rangle_i^1 }[/math] is a special case which denotes the ith column vector of A.
[math]\displaystyle{ \,[x]_1^T }[/math] denotes the sequences of elements [math]\displaystyle{ \left\{ x_1, x_2, ... , x_T \right\} }[/math] and [math]\displaystyle{ [x]_i }[/math] denotes the ith element of the sequence.
[math]\displaystyle{ \,d_{wrd} }[/math] is the word vector size, given by the user.
[math]\displaystyle{ \,d_{wrd}^k }[/math] is the vector size for a feature k, given by the user.
[math]\displaystyle{ \,k_{sz} }[/math] is the window size, given by the user.
[math]\displaystyle{ \,n_{hu}^l }[/math] is the number of hidden units for the lth layer, given by the user.
Feature Vectors
Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.
For each word w ∈ D the lookup table layer LTw(.) gives a dwrd-dimensional feature vector representation:
- [math]\displaystyle{ LT_W(.) = \langle W\rangle_w^1 }[/math]
Where W ∈ Rdwrdx|D|. is a matrix of parameters to be learnt.
For a sequence of words [math]\displaystyle{ \,[x]_1^T }[/math], the lookup table layer produces the following output matrix:
- [math]\displaystyle{ LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1) }[/math]
To add additional features, we can represent the word as K discrete features w ∈ D1×...×Dk, where Dk is the dictionary for the kth feature. Each feature is associated with a lookup table LTwk(.) with parameters Wk ∈ Rdwrdx|D|, and the outputs are concatenated to form the final feature vector.
- [math]\displaystyle{ LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ \vdots \\ LT_{w^2}(w_2)\end{array} \right) = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ \vdots \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) }[/math]
Windowed Approach
This approach assumes the tag for a word depends mostly on its surrounding words. Given a word to be tagged, we feed a window of ksz words around the word of interest to the lookup table layer, which outputs a matrix of size dwrd×ksz. Finally, the columns of this output matrix are concatenated to use as input to the next layer.
- [math]\displaystyle{ f_\theta^1 = \langle LT_W([w]_1^T) \rangle _t ^{d_{win}} = \left( \begin{array}{c} \langle W\rangle_{[w]_{t-d_{win}/2}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t+d_{win}/2}}^1 \end{array} \right) }[/math]
The output from the first layer is fed to one or more linear layers:
- [math]\displaystyle{ f_\theta^l = \langle W \rangle^l f_\theta^{l-1} + b^l }[/math]
Where [math]\displaystyle{ \,W^l \in R^{n_{hu}^l × n_{hu}^l-1} }[/math] and [math]\displaystyle{ b^l \in R^{n_{hu}^l} }[/math] are parameters to be trained.
Linear layers may be interleaved with layers which apply a hard hyperbolic tangent function over their inputs:
- [math]\displaystyle{ \left[f_\theta^l\right]_i = HardTanh(\left[f_\theta^{l-1} \right]_i) }[/math]
Where HardTanh is defined as:
- [math]\displaystyle{ HardTanh(x) = \left\{ \begin{array}{l} -1\ if\ x \lt -1 \\ x\ if\ -1 \leq x \leq 1 \\ 1\ if\ x \gt 1 \end{array} \right. }[/math]
Finally, the output size of the last layer is the number of tags for a particular task, and each output entry can be interpreted as a score for that tag.
Sentence-Level Approach
The windowed approach described above is appropriate for tasks where only information about the surrounding words is needed, but is not appropriate for tasks like SRL where the appropriate tag for a word may depend on the whole sentence. A convolutional approach is taken to solve this problem.
The convolutional network produces features for each word in the sentence and for each verb in the sentence. For each word i, two extra features are added which encode the distance between i and the word to tag (i - posw), and i and the verb being considered (i - posv).
Results
Unlabelled Data
Discussion
Task-Specific Engineering
References
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.