Difference between revisions of "Semantic Relation Classification——via Convolution Neural Network"

From statwiki
Jump to: navigation, search
Line 1: Line 1:
 
    
 
    
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length $N$, which looks like  
+
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like  
 
$$e=[e_{1},e_{2},\ldots,e_{N}]$$
 
$$e=[e_{1},e_{2},\ldots,e_{N}]$$
 
and each entry represents a token of the word. Also, to apply  
 
and each entry represents a token of the word. Also, to apply  
 
convolutional neural network, the subsets of features
 
convolutional neural network, the subsets of features
 
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$
 
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$
is given to a weight matrix $W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}$ to  
+
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to  
 
produce a new feature, defiend as  
 
produce a new feature, defiend as  
 
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$
 
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$
This process is applied to all subsets of features with length $k$ starting  
+
This process is applied to all subsets of features with length <math> k </math> starting  
 
from the first one. Then a mapped feature factor  
 
from the first one. Then a mapped feature factor  
 
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$
 
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$
 
is produced.\\
 
is produced.\\
 
\\
 
\\
The max pooling operation is used, the $\hat{c}=max\{c\}$ was picked.
+
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.
 
With different weight filter, different mapped feature vectors can be obtained. Finally, the original  
 
With different weight filter, different mapped feature vectors can be obtained. Finally, the original  
sentence $e$ can be converted into a new representation $r_{x}$ with a fixed length. For example, if there are 5 filters,
+
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,
then there are 5 features ($\hat{c}$) picked to create $r_{x}$ for each $x$.\\
+
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.\\
 
Then, the score vector  
 
Then, the score vector  
 
$$s(x)=W^{classes}r_{x}$$
 
$$s(x)=W^{classes}r_{x}$$
is obtained which represented the score for each class, given $x$'s entities' relation will be classified as  
+
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as  
the one with the highest score. The $W^{classes}$ here is the model being trained.\\
+
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.\\
 
\\
 
\\
 
To improve the performance, “Negative Sampling" was used. Given the trained data point  
 
To improve the performance, “Negative Sampling" was used. Given the trained data point  
$\tilde{x}$, and its correct class $\tilde{y}$. Let $I=Y\setminus\{\tilde{y}\}$ represent the  
+
<math> \tilde{x}$, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the  
incorrect labels for $x$. Basically, the distance between the correct score and the positive margin, and the negative  
+
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative  
 
distance (negative margin plus the second largest score) should be minimized. So the loss function is  
 
distance (negative margin plus the second largest score) should be minimized. So the loss function is  
 
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$
 
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$
with margins $m_{+}$, $m_{-}$, and penalty scale factor $\gamma$.
+
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.
 
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total,  
 
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total,  
 
and 49,600 of them are unique.
 
and 49,600 of them are unique.

Revision as of 15:00, 22 November 2020

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length [math] N [/math], which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and each entry represents a token of the word. Also, to apply convolutional neural network, the subsets of features $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ is given to a weight matrix [math] W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}[/math] to produce a new feature, defiend as $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ This process is applied to all subsets of features with length [math] k [/math] starting from the first one. Then a mapped feature factor $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ is produced.\\ \\ The max pooling operation is used, the [math] \hat{c}=max\{c\} [/math] was picked. With different weight filter, different mapped feature vectors can be obtained. Finally, the original sentence [math] e [/math] can be converted into a new representation [math] r_{x} [/math] with a fixed length. For example, if there are 5 filters, then there are 5 features ([math] \hat{c} [/math]) picked to create [math] r_{x} [/math] for each [math] x [/math].\\ Then, the score vector $$s(x)=W^{classes}r_{x}$$ is obtained which represented the score for each class, given [math] x [/math]'s entities' relation will be classified as the one with the highest score. The [math] W^{classes} [/math] here is the model being trained.\\ \\ To improve the performance, “Negative Sampling" was used. Given the trained data point [math] \tilde{x}$, and its correct class \lt math\gt \tilde{y} [/math]. Let [math] I=Y\setminus\{\tilde{y}\} [/math] represent the incorrect labels for [math] x [/math]. Basically, the distance between the correct score and the positive margin, and the negative distance (negative margin plus the second largest score) should be minimized. So the loss function is $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ with margins [math] m_{+} [/math], [math] m_{-} [/math], and penalty scale factor [math] \gamma [/math]. The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, and 49,600 of them are unique.