stat441w18/A New Method of Region Embedding for Text Classification

From statwiki
Revision as of 22:27, 7 March 2018 by K285li (talk | contribs)
Jump to navigation Jump to search

Method

This paper focuses on representing small text regions which can preserve local internal structural information for specific text classification. It defines [math]\displaystyle{ region\left ( i,c\right ) }[/math] as the [math]\displaystyle{ 2\times c+1 }[/math] length region with middle word [math]\displaystyle{ \omega_i }[/math] which is the i-th word of the document. And then it uses word embeddings and the local context units to produce region embedding. In the following, we first introduce local context unit, then two architectures to generate the region embedding, and how to classify text.

Local context unit

The vocabulary is represented by a matrix [math]\displaystyle{ \mathbf{E}\in \mathbb{R}^{h \times v} }[/math] with a look up layer, denoted by the embedding [math]\displaystyle{ e_\omega }[/math]. The i-th column represents the embedding of [math]\displaystyle{ \omega_i }[/math], denoted by [math]\displaystyle{ \mathbf{e}_{\omega_i} }[/math].

For each word [math]\displaystyle{ \omega_i }[/math], we define the local context unit [math]\displaystyle{ \mathbf{K}_{\omega_i}\in \mathbb{R}^{h\times\left (2c+1\right )} }[/math]. Let [math]\displaystyle{ \mathbf{K}_{\omega_i,t} }[/math] be the (c+t)-th column in [math]\displaystyle{ \mathbf{K}_{\omega_i} \left (t \in \left [ -c,c \right ] \right ) }[/math], representing a distinctive linear projection function on [math]\displaystyle{ \mathbf{e}_{c+t} }[/math] in the local context [math]\displaystyle{ r\left (i,c\right ) }[/math]. Thus, we can utilize local ordered word information in terms of each word.

Define [math]\displaystyle{ \mathbf{p}_{\omega_i+t}^i }[/math] as the projected word embedding of [math]\displaystyle{ \omega_i+t }[/math] in i-th word’s view, computed by: [math]\displaystyle{ \mathbf{p}_{\omega_i+t}^i = \mathbf{K}_{\omega_i,t} \odot \mathbf{\omega_{i+t}} }[/math] where [math]\displaystyle{ \odot }[/math] denotes an element-wise multiplication.

Note local context units and embedding are learned as model parameters. Local context units can be learned to capture the semantic and syntactic influence of each word to its context.

Word-context region embedding

We proposed two architectures to perform the region embedding from different perspectives. The first one is word-context region embedding, and the second one is context-word region embedding. In this paper, we consider middle words of the regions, hence we can compose the semantics of a give region only by the middle words influences on the context words, or the context words’ influences on the middle word. The most important thing from these two different perspectives in this model is the interaction between words.

In the first proposed architecture, we focus on addressing the middle word's influences on the context words. For example, in the sentence The food is not very good in this hotel, the occurrence of word not might bring a semantic reversal to the local region.

The figure below is the architecture in a word-context view.

From the bottom to the top, the architecture is,

Step1: Find the middle word [math]\displaystyle{ w_{i} }[/math] and choose t from -c to c.

Step2: We use the local context unit of the middle word and the original word embeddings in a region and do element-wise multiply to perform the projected embeddings in a word-to-context view,

[math]\displaystyle{ \mathbf{p}_{w_{i+t}}^i = \mathbf{K}_{w_i,t}\cdot\mathbf{e}_{w_{i+t}} }[/math]

where the projected embeddings can reflect the middle word’s influences on the context words.

Step3: Once the projected embeddings are obtained, a max pooling operation is applied to extract the most predictive features in the region. The output of the max pooling operation can be regarded as a task related region embedding in a word-to-context view.

f1

context - Word region embedding

The second architecture goes as a different view, which addresses the local context words’ influences on the middle word in the region, and we call this Context-Word region embedding.

The difference is the projected embeddings are computed by the original word embedding of the middle word and the context units of all words in the region, then the Context-Word region embedding can be obtained by a max pooling operation through the column dimension of the projected embedding matrix:

f2

Two models take different ways to produce the projected word embeddings, the Word-Context model uses context units of middle words and word embeddings of context words, while the Context-Word model uses context units of context words and word embeddings of the middle word.