Semantic Relation Classification——via Convolution Neural Network: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length | After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like | ||
$$e=[e_{1},e_{2},\ldots,e_{N}]$$ | $$e=[e_{1},e_{2},\ldots,e_{N}]$$ | ||
and each entry represents a token of the word. Also, to apply | and each entry represents a token of the word. Also, to apply | ||
convolutional neural network, the subsets of features | convolutional neural network, the subsets of features | ||
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ | $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ | ||
is given to a weight matrix | is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to | ||
produce a new feature, defiend as | produce a new feature, defiend as | ||
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ | $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ | ||
This process is applied to all subsets of features with length | This process is applied to all subsets of features with length <math> k </math> starting | ||
from the first one. Then a mapped feature factor | from the first one. Then a mapped feature factor | ||
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ | $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ | ||
is produced.\\ | is produced.\\ | ||
\\ | \\ | ||
The max pooling operation is used, the | The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked. | ||
With different weight filter, different mapped feature vectors can be obtained. Finally, the original | With different weight filter, different mapped feature vectors can be obtained. Finally, the original | ||
sentence | sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters, | ||
then there are 5 features ( | then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.\\ | ||
Then, the score vector | Then, the score vector | ||
$$s(x)=W^{classes}r_{x}$$ | $$s(x)=W^{classes}r_{x}$$ | ||
is obtained which represented the score for each class, given | is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as | ||
the one with the highest score. The | the one with the highest score. The <math> W^{classes} </math> here is the model being trained.\\ | ||
\\ | \\ | ||
To improve the performance, “Negative Sampling" was used. Given the trained data point | To improve the performance, “Negative Sampling" was used. Given the trained data point | ||
<math> \tilde{x}$, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the | |||
incorrect labels for | incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative | ||
distance (negative margin plus the second largest score) should be minimized. So the loss function is | distance (negative margin plus the second largest score) should be minimized. So the loss function is | ||
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ | $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ | ||
with margins | with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>. | ||
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, | The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, | ||
and 49,600 of them are unique. | and 49,600 of them are unique. |
Revision as of 15:00, 22 November 2020
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length [math]\displaystyle{ N }[/math], which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and each entry represents a token of the word. Also, to apply convolutional neural network, the subsets of features $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ is given to a weight matrix [math]\displaystyle{ W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k} }[/math] to produce a new feature, defiend as $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ This process is applied to all subsets of features with length [math]\displaystyle{ k }[/math] starting from the first one. Then a mapped feature factor $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ is produced.\\ \\ The max pooling operation is used, the [math]\displaystyle{ \hat{c}=max\{c\} }[/math] was picked. With different weight filter, different mapped feature vectors can be obtained. Finally, the original sentence [math]\displaystyle{ e }[/math] can be converted into a new representation [math]\displaystyle{ r_{x} }[/math] with a fixed length. For example, if there are 5 filters, then there are 5 features ([math]\displaystyle{ \hat{c} }[/math]) picked to create [math]\displaystyle{ r_{x} }[/math] for each [math]\displaystyle{ x }[/math].\\ Then, the score vector $$s(x)=W^{classes}r_{x}$$ is obtained which represented the score for each class, given [math]\displaystyle{ x }[/math]'s entities' relation will be classified as the one with the highest score. The [math]\displaystyle{ W^{classes} }[/math] here is the model being trained.\\ \\ To improve the performance, “Negative Sampling" was used. Given the trained data point [math]\displaystyle{ \tilde{x}$, and its correct class \lt math\gt \tilde{y} }[/math]. Let [math]\displaystyle{ I=Y\setminus\{\tilde{y}\} }[/math] represent the incorrect labels for [math]\displaystyle{ x }[/math]. Basically, the distance between the correct score and the positive margin, and the negative distance (negative margin plus the second largest score) should be minimized. So the loss function is $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ with margins [math]\displaystyle{ m_{+} }[/math], [math]\displaystyle{ m_{-} }[/math], and penalty scale factor [math]\displaystyle{ \gamma }[/math]. The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, and 49,600 of them are unique.