Semantic Relation Classification——via Convolution Neural Network

From statwiki
Revision as of 15:02, 22 November 2020 by R6gong (talk | contribs)
Jump to: navigation, search

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length [math] N [/math], which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and each entry represents a token of the word. Also, to apply convolutional neural network, the subsets of features $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ is given to a weight matrix [math] W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}[/math] to produce a new feature, defiend as $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ This process is applied to all subsets of features with length [math] k [/math] starting from the first one. Then a mapped feature factor $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ is produced.

The max pooling operation is used, the [math] \hat{c}=max\{c\} [/math] was picked. With different weight filter, different mapped feature vectors can be obtained. Finally, the original sentence [math] e [/math] can be converted into a new representation [math] r_{x} [/math] with a fixed length. For example, if there are 5 filters, then there are 5 features ([math] \hat{c} [/math]) picked to create [math] r_{x} [/math] for each [math] x [/math].

Then, the score vector $$s(x)=W^{classes}r_{x}$$ is obtained which represented the score for each class, given [math] x [/math]'s entities' relation will be classified as the one with the highest score. The [math] W^{classes} [/math] here is the model being trained.

To improve the performance, “Negative Sampling" was used. Given the trained data point [math] \tilde{x} [/math], and its correct class [math] \tilde{y} [/math]. Let [math] I=Y\setminus\{\tilde{y}\} [/math] represent the incorrect labels for [math] x [/math]. Basically, the distance between the correct score and the positive margin, and the negative distance (negative margin plus the second largest score) should be minimized. So the loss function is $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ with margins [math] m_{+} [/math], [math] m_{-} [/math], and penalty scale factor [math] \gamma [/math]. The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, and 49,600 of them are unique.