[Please don't edit yet - updating section-by-section]

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Words are preprocessed by lower-casing and encoding capitalization as an additional feature.

Notation and Hyper-Parameters

[math]\displaystyle{ \,f_\theta }[/math](.) denotes any neural network with parameters [math]\displaystyle{ \,\theta }[/math]. Any feed-forward neural network with L layers can be written as a composition of functions as follows: [math]\displaystyle{ \,f_\theta = f_\theta^L(f_\theta^{L-1}(...f_\theta^1(.)...)) }[/math].

[math]\displaystyle{ \,[A]_{i,j} }[/math] denotes the element of a matrix A at row i and column j.

[math]\displaystyle{ \,\langle A\rangle_i^{d_{win}} }[/math] denotes the matrix obtained by concatenating the d_win columns vectors around the i^th column vector of a matrix A ∈ R^d₁xd₂. i.e., [math]\displaystyle{ \,[\langle A\rangle_i^{d_{win}}]^T = ([A]_{1,i-d_{win}/2} ... [A]_{d_1,i-d_{win}/2},...., [A]_{1,i+d_{win}/2} ... [A]_{d_1,i+d_{win}/2}) }[/math].

[math]\displaystyle{ \,\langle A\rangle_i^1 }[/math] is a special case which denotes the i^th column vector of A.

[math]\displaystyle{ \,[x]_1^T }[/math] denotes the sequences of elements [math]\displaystyle{ \left\{ x_1, x_2, ... , x_T \right\} }[/math] and [math]\displaystyle{ [x]_i }[/math] denotes the i^th element of the sequence.

[math]\displaystyle{ \,d_{wrd} }[/math] is the word vector size, given by the user.

[math]\displaystyle{ \,d_{wrd}^k }[/math] is the vector size for a feature k, given by the user.

[math]\displaystyle{ \,k_{sz} }[/math] is the window size, given by the user.

[math]\displaystyle{ \,n_{hu}^l }[/math] is the number of hidden units for the l^th layer, given by the user.

Feature Vectors

Words are fed into the network as indices in a finite dictionary D. The first layer of the network maps each of the word indices to a feature vector.

For each word w ∈ D the lookup table layer LT_w(.) gives a d_wrd-dimensional feature vector representation:

[math]\displaystyle{ LT_W(.) = \langle W\rangle_w^1 }[/math]

Where W ∈ R^d_wrdx|D|. is a matrix of parameters to be learnt.

For a sequence of words [math]\displaystyle{ \,[x]_1^T }[/math], the lookup table layer produces the following output matrix:

[math]\displaystyle{ LT_W([w]_1^T) = (\langle W\rangle_{[w]_1}^1 \langle W\rangle_{[w]_2}^1) ... \langle W\rangle_{[w]_T}^1) }[/math]

To add additional features, we can represent the word as K discrete features w ∈ D₁×...×D_k, where D_k is the dictionary for the k^th feature. Each feature is associated with a lookup table LT_w^k(.) with parameters W^k ∈ R^d_wrdx|D|, and the outputs are concatenated to form the final feature vector.

[math]\displaystyle{ LT_{w^1,...,w^K}(w)= \left( \begin{array}{c} LT_{w^1}(w_1) \\ \vdots \\ LT_{w^2}(w_2)\end{array} \right) = = \left( \begin{array}{c} \langle W\rangle_{[w]_1}^1 \\ \vdots \\ \langle W\rangle_{[w]_K}^1 \end{array} \right) }[/math]

Windowed Approach

This approach assumes the tag for a word depends mostly on its surrounding words. Given a word to be tagged, we feed a window of k_sz words around the word of interest to the lookup table layer, which outputs a matrix of size d_wrd×k_sz. Finally, the columns of this output matrix are concatenated to use as input to the next layer.

[math]\displaystyle{ f_\theta^1 = \langle LT_W([w]_1^T) \rangle _t ^{d_{win}} = \left( \begin{array}{c} \langle W\rangle_{[w]_{t-d_{win}/2}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t}}^1 \\ \vdots \\ \langle W\rangle_{[w]_{t+d_{win}/2}}^1 \end{array} \right) }[/math]

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.

natural language processing (almost) from scratch.

Contents

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Notation and Hyper-Parameters

Feature Vectors

Windowed Approach

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

Navigation menu

natural language processing (almost) from scratch.

Overview

Benchmark Tasks

Part of Speech Labelling (POS)

Chunking (CHUNK)

Named Entity Recognition (NER)

Semantic Role Labelling (SLR)

Network Design

Notation and Hyper-Parameters

Feature Vectors

Windowed Approach

Sentence-Level Approach

Results

Unlabelled Data

Discussion

Task-Specific Engineering

References

Navigation menu

Search