natural language processing (almost) from scratch.: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "[Please don't edit yet - updating section-by-section] = Overview = = Benchmark Tasks = == Part of Speech Labelling (POS) == == Chunking (CHUNK) == == Named Entity Recognitio...")
 
Line 44: Line 44:


For a sequence of words <math>\,[x]_1^T</math>, the lookup table layer produces the following output matrix:
For a sequence of words <math>\,[x]_1^T</math>, the lookup table layer produces the following output matrix:
:<math>LT_W(.) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1)</math>
:<math>LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1)</math>


To add additional features, we can represent the word as ''K'' discrete features ''w ∈ D<sub>1</sub>×...×D<sub>k</sub>'', where ''D<sub>k</sub> is the dictionary for the ''k<sup>th</sup>'' feature. Each feature is associated with a lookup table ''LT<sub>w</sub><sup>k</sup>(.)'' with parameters ''W<sup>k</sup> ∈ R<sup>d<sub>wrd</sub>x|D|</sup>'', and the outputs are concatenated to form the final feature vector.
To add additional features, we can represent the word as ''K'' discrete features ''w ∈ D<sub>1</sub>×...×D<sub>k</sub>'', where ''D<sub>k</sub> is the dictionary for the ''k<sup>th</sup>'' feature. Each feature is associated with a lookup table ''LT<sub>w</sub><sup>k</sup>(.)'' with parameters ''W<sup>k</sup> ∈ R<sup>d<sub>wrd</sub>x|D|</sup>'', and the outputs are concatenated to form the final feature vector.


:<math>LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ . \\ . \\ . \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ . \\ . \\ . \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) </math>
:<math>LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ . \\ . \\ . \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ . \\ . \\ . \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) </math>


== Windowed Approach ==
== Windowed Approach ==

Revision as of 10:55, 12 November 2015

[Please don't edit yet - updating section-by-section]

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Words are preprocessed by lower-casing and encoding capitalization as an additional feature.

Notation and Hyper-Parameters

[math]\displaystyle{ \,f_\theta }[/math](.) denotes any neural network with parameters [math]\displaystyle{ \,\theta }[/math]. Any feed-forward neural network with L layers can be written as a composition of functions as follows: [math]\displaystyle{ \,f_\theta = f_\theta^L(f_\theta^{L-1}(...f_\theta^1(.)...)) }[/math].

[math]\displaystyle{ \,[A]_{i,j} }[/math] denotes the element of a matrix A at row i and column j.

[math]\displaystyle{ \,\langle A\rangle_i^{d_{win}} }[/math] denotes the matrix obtained by concatenating the dwin columns vectors around the ith column vector of a matrix A ∈ Rd1xd2. i.e., [math]\displaystyle{ \,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,i-d_{win}/2} ... [A]_{d_1,i-d_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2}) }[/math].

[math]\displaystyle{ \,\langle A\rangle_i^1 }[/math] is a special case which denotes the ith column vector of A.

[math]\displaystyle{ \,[x]_1^T }[/math] denotes the sequences of elements [math]\displaystyle{ \left\{ x_1, x_2, ... , x_T \right\} }[/math] and [math]\displaystyle{ [x]_i }[/math] denotes the ith element of the sequence.

[math]\displaystyle{ \,d_{wrd} }[/math] is the word vector size, given by the user.

[math]\displaystyle{ \,d_{wrd}^k }[/math] is the vector size for a feature k, given by the user.

Feature Vectors

Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.

For each word w ∈ D the lookup table layer LTw(.) gives a dwrd-dimensional feature vector representation:

[math]\displaystyle{ LT_W(.) = \langle W\rangle_w^1 }[/math]

Where W ∈ Rdwrdx|D|. is a matrix of parameters to be learnt.

For a sequence of words [math]\displaystyle{ \,[x]_1^T }[/math], the lookup table layer produces the following output matrix:

[math]\displaystyle{ LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1) }[/math]

To add additional features, we can represent the word as K discrete features w ∈ D1×...×Dk, where Dk is the dictionary for the kth feature. Each feature is associated with a lookup table LTwk(.) with parameters Wk ∈ Rdwrdx|D|, and the outputs are concatenated to form the final feature vector.

[math]\displaystyle{ LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ . \\ . \\ . \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ . \\ . \\ . \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) }[/math]

Windowed Approach

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.