Semantic Relation Classification——via Convolution Neural Network

From statwiki
Revision as of 15:51, 22 November 2020 by R6gong (talk | contribs) (Created page with " After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length $N$, which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and eac...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length $N$, which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and each entry represents a token of the word. Also, to apply convolutional neural network, the subsets of features $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ is given to a weight matrix $W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}$ to produce a new feature, defiend as $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ This process is applied to all subsets of features with length $k$ starting from the first one. Then a mapped feature factor $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ is produced.\\ \\ The max pooling operation is used, the $\hat{c}=max\{c\}$ was picked. With different weight filter, different mapped feature vectors can be obtained. Finally, the original sentence $e$ can be converted into a new representation $r_{x}$ with a fixed length. For example, if there are 5 filters, then there are 5 features ($\hat{c}$) picked to create $r_{x}$ for each $x$.\\ Then, the score vector $$s(x)=W^{classes}r_{x}$$ is obtained which represented the score for each class, given $x$'s entities' relation will be classified as the one with the highest score. The $W^{classes}$ here is the model being trained.\\ \\ To improve the performance, “Negative Sampling" was used. Given the trained data point $\tilde{x}$, and its correct class $\tilde{y}$. Let $I=Y\setminus\{\tilde{y}\}$ represent the incorrect labels for $x$. Basically, the distance between the correct score and the positive margin, and the negative distance (negative margin plus the second largest score) should be minimized. So the loss function is $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ with margins $m_{+}$, $m_{-}$, and penalty scale factor $\gamma$. The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, and 49,600 of them are unique.