Revision as of 10:55, 12 November 2015

[Please don't edit yet - updating section-by-section]

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Words are preprocessed by lower-casing and encoding capitalization as an additional feature.

Notation and Hyper-Parameters

[math]\displaystyle{ \,f_\theta }[/math](.) denotes any neural network with parameters [math]\displaystyle{ \,\theta }[/math]. Any feed-forward neural network with L layers can be written as a composition of functions as follows: [math]\displaystyle{ \,f_\theta = f_\theta^L(f_\theta^{L-1}(...f_\theta^1(.)...)) }[/math].

[math]\displaystyle{ \,[A]_{i,j} }[/math] denotes the element of a matrix A at row i and column j.

[math]\displaystyle{ \,\langle A\rangle_i^{d_{win}} }[/math] denotes the matrix obtained by concatenating the d_win columns vectors around the i^th column vector of a matrix A ∈ R^d₁xd₂. i.e., [math]\displaystyle{ \,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,i-d_{win}/2} ... [A]_{d_1,i-d_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2}) }[/math].

[math]\displaystyle{ \,\langle A\rangle_i^1 }[/math] is a special case which denotes the i^th column vector of A.

[math]\displaystyle{ \,[x]_1^T }[/math] denotes the sequences of elements [math]\displaystyle{ \left\{ x_1, x_2, ... , x_T \right\} }[/math] and [math]\displaystyle{ [x]_i }[/math] denotes the i^th element of the sequence.

[math]\displaystyle{ \,d_{wrd} }[/math] is the word vector size, given by the user.

[math]\displaystyle{ \,d_{wrd}^k }[/math] is the vector size for a feature k, given by the user.

Feature Vectors

Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.

For each word w ∈ D the lookup table layer LT_w(.) gives a d_wrd-dimensional feature vector representation:

[math]\displaystyle{ LT_W(.) = \langle W\rangle_w^1 }[/math]

Where W ∈ R^d_wrdx|D|. is a matrix of parameters to be learnt.

For a sequence of words [math]\displaystyle{ \,[x]_1^T }[/math], the lookup table layer produces the following output matrix:

[math]\displaystyle{ LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1) }[/math]

To add additional features, we can represent the word as K discrete features w ∈ D₁×...×D_k, where D_k is the dictionary for the k^th feature. Each feature is associated with a lookup table LT_w^k(.) with parameters W^k ∈ R^d_wrdx|D|, and the outputs are concatenated to form the final feature vector.

[math]\displaystyle{ LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ . \\ . \\ . \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ . \\ . \\ . \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) }[/math]

Windowed Approach

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.

@@ Line 44: / Line 44: @@
 For a sequence of words <math>\,[x]_1^T</math>, the lookup table layer produces the following output matrix:
-:<math>LT_W(.) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1)</math>
+:<math>LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1)</math>
 To add additional features, we can represent the word as ''K'' discrete features ''w ∈ D<sub>1</sub>×...×D<sub>k</sub>'', where ''D<sub>k</sub> is the dictionary for the ''k<sup>th</sup>'' feature. Each feature is associated with a lookup table ''LT<sub>w</sub><sup>k</sup>(.)'' with parameters ''W<sup>k</sup> ∈ R<sup>d<sub>wrd</sub>x|D|</sup>'', and the outputs are concatenated to form the final feature vector.
 :<math>LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ . \\ . \\ . \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ . \\ . \\ . \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) </math>
 == Windowed Approach ==

natural language processing (almost) from scratch.: Difference between revisions

Revision as of 10:55, 12 November 2015

Contents

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Notation and Hyper-Parameters

Feature Vectors

Windowed Approach

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

Navigation menu

natural language processing (almost) from scratch.: Difference between revisions

Revision as of 10:55, 12 November 2015

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Notation and Hyper-Parameters

Feature Vectors

Windowed Approach

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

Navigation menu

Search